All of lore.kernel.org
 help / color / mirror / Atom feed
* [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS)
@ 2009-01-27  2:17 Andy Grover
  2009-01-27  2:17 ` [ofa-general] [PATCH 01/21] RDS: Socket interface Andy Grover
                   ` (22 more replies)
  0 siblings, 23 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, rds-devel, general


Hi Roland,

This patchset adds support for RDS as an Infiniband ULP. RDS is an
Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably,
and is used currently in Oracle RAC and Exadata products. It's lived
in OFED for 2+ years and I think it's time to get it upstream -- most
likely into your -next tree for .30, but if it snuck into .29 via the
"new code merge-window exception" then even better.

I've run checkpatch & sparse to clean up as many issues as possible so
what remains are really the design peculiarities (aka warts) that arise
from being a protocol designed by one company for a single critical
application. I think upstreaming this code is the first step towards
working out those issues, and making the end result available to a wider
audience.

Also available for review at:
git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6 for-roland

Thoughts? shortlog follows.

Thanks -- Regards -- Andy

Andy Grover (21):
      RDS: Socket interface
      RDS: Main header file
      RDS: Congestion-handling code
      RDS: Transport code
      RDS: Info and stats
      RDS: Connection handling
      RDS: loopback
      RDS: sysctls
      RDS: Message parsing
      RDS: send.c
      RDS: recv.c
      RDS: RDMA support
      RDS/IB: Infiniband transport
      RDS/IB: Ring-handling code.
      RDS/IB: Implement RDMA ops using FMRs
      RDS/IB: Implement IB-specific datagram send.
      RDS/IB: Receive datagrams via IB
      RDS/IB: Stats and sysctls
      RDS: Documentation
      RDS: Kconfig and Makefile
      RDS: Add AF and PF #defines for RDS sockets

 Documentation/networking/rds.txt        |  356 +++++++++++
 drivers/infiniband/Kconfig              |    2 +
 drivers/infiniband/Makefile             |    1 +
 drivers/infiniband/ulp/rds/Kconfig      |   13 +
 drivers/infiniband/ulp/rds/Makefile     |   13 +
 drivers/infiniband/ulp/rds/af_rds.c     |  677 +++++++++++++++++++++
 drivers/infiniband/ulp/rds/bind.c       |  202 +++++++
 drivers/infiniband/ulp/rds/cong.c       |  424 +++++++++++++
 drivers/infiniband/ulp/rds/connection.c |  501 +++++++++++++++
 drivers/infiniband/ulp/rds/ib.c         |  312 ++++++++++
 drivers/infiniband/ulp/rds/ib.h         |  358 +++++++++++
 drivers/infiniband/ulp/rds/ib_cm.c      |  882
+++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/ib_rdma.c    |  641 ++++++++++++++++++++
 drivers/infiniband/ulp/rds/ib_recv.c    |  894
+++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/ib_ring.c    |  168 +++++
 drivers/infiniband/ulp/rds/ib_send.c    |  852
++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/ib_stats.c   |   95 +++
 drivers/infiniband/ulp/rds/ib_sysctl.c  |  137 +++++
 drivers/infiniband/ulp/rds/info.c       |  243 ++++++++
 drivers/infiniband/ulp/rds/info.h       |   43 ++
 drivers/infiniband/ulp/rds/loop.c       |  189 ++++++
 drivers/infiniband/ulp/rds/loop.h       |    9 +
 drivers/infiniband/ulp/rds/message.c    |  414 +++++++++++++
 drivers/infiniband/ulp/rds/page.c       |  222 +++++++
 drivers/infiniband/ulp/rds/rdma.c       |  682 +++++++++++++++++++++
 drivers/infiniband/ulp/rds/rdma.h       |   84 +++
 drivers/infiniband/ulp/rds/rds.h        |  763 +++++++++++++++++++++++
 drivers/infiniband/ulp/rds/rds_rdma.h   |  245 ++++++++
 drivers/infiniband/ulp/rds/recv.c       |  550 +++++++++++++++++
 drivers/infiniband/ulp/rds/send.c       | 1006
+++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/stats.c      |  150 +++++
 drivers/infiniband/ulp/rds/sysctl.c     |  164 +++++
 drivers/infiniband/ulp/rds/threads.c    |  273 +++++++++
 drivers/infiniband/ulp/rds/transport.c  |  134 ++++
 include/linux/socket.h                  |    4 +-
 35 files changed, 11702 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/networking/rds.txt
 create mode 100644 drivers/infiniband/ulp/rds/Kconfig
 create mode 100644 drivers/infiniband/ulp/rds/Makefile
 create mode 100644 drivers/infiniband/ulp/rds/af_rds.c
 create mode 100644 drivers/infiniband/ulp/rds/bind.c
 create mode 100644 drivers/infiniband/ulp/rds/cong.c
 create mode 100644 drivers/infiniband/ulp/rds/connection.c
 create mode 100644 drivers/infiniband/ulp/rds/ib.c
 create mode 100644 drivers/infiniband/ulp/rds/ib.h
 create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_send.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c
 create mode 100644 drivers/infiniband/ulp/rds/info.c
 create mode 100644 drivers/infiniband/ulp/rds/info.h
 create mode 100644 drivers/infiniband/ulp/rds/loop.c
 create mode 100644 drivers/infiniband/ulp/rds/loop.h
 create mode 100644 drivers/infiniband/ulp/rds/message.c
 create mode 100644 drivers/infiniband/ulp/rds/page.c
 create mode 100644 drivers/infiniband/ulp/rds/rdma.c
 create mode 100644 drivers/infiniband/ulp/rds/rdma.h
 create mode 100644 drivers/infiniband/ulp/rds/rds.h
 create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h
 create mode 100644 drivers/infiniband/ulp/rds/recv.c
 create mode 100644 drivers/infiniband/ulp/rds/send.c
 create mode 100644 drivers/infiniband/ulp/rds/stats.c
 create mode 100644 drivers/infiniband/ulp/rds/sysctl.c
 create mode 100644 drivers/infiniband/ulp/rds/threads.c
 create mode 100644 drivers/infiniband/ulp/rds/transport.c

end

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] [PATCH 01/21] RDS: Socket interface
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  3:46   ` Stephen Hemminger
                     ` (2 more replies)
  2009-01-27  2:17 ` [ofa-general] [PATCH 02/21] RDS: Main header file Andy Grover
                   ` (21 subsequent siblings)
  22 siblings, 3 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, rds-devel, general

Implement the RDS (Reliable Datagram Sockets) interface.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/af_rds.c |  677 +++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/bind.c   |  202 +++++++++++
 2 files changed, 879 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/af_rds.c
 create mode 100644 drivers/infiniband/ulp/rds/bind.c

diff --git a/drivers/infiniband/ulp/rds/af_rds.c b/drivers/infiniband/ulp/rds/af_rds.c
new file mode 100644
index 0000000..7158438
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/af_rds.c
@@ -0,0 +1,677 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/poll.h>
+#include <linux/version.h>
+#include <net/sock.h>
+
+#include "rds.h"
+#include "rdma.h"
+
+static int enable_rdma = 1;
+
+module_param(enable_rdma, int, 0444);
+MODULE_PARM_DESC(enable_rdma, " Enable RDMA operations support");
+
+/* this is just used for stats gathering :/ */
+static DEFINE_SPINLOCK(rds_sock_lock);
+static unsigned long rds_sock_count;
+static LIST_HEAD(rds_sock_list);
+DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq);
+
+/*
+ * This is called as the final descriptor referencing this socket is closed.
+ * We have to unbind the socket so that another socket can be bound to the
+ * address it was using.
+ *
+ * We have to be careful about racing with the incoming path.  sock_orphan()
+ * sets SOCK_DEAD and we use that as an indicator to the rx path that new
+ * messages shouldn't be queued.
+ */
+static int rds_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs;
+	unsigned long flags;
+
+	if (sk == NULL)
+		goto out;
+
+	rs = rds_sk_to_rs(sk);
+
+	sock_orphan(sk);
+	/* Note - rds_clear_recv_queue grabs rs_recv_lock, so
+	 * that ensures the recv path has completed messing
+	 * with the socket. */
+	rds_clear_recv_queue(rs);
+	rds_cong_remove_socket(rs);
+	rds_remove_bound(rs);
+	rds_send_drop_to(rs, NULL);
+	rds_rdma_drop_keys(rs);
+	rds_notify_queue_get(rs, NULL);
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+	list_del_init(&rs->rs_item);
+	rds_sock_count--;
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+
+	sock->sk = NULL;
+	sock_put(sk);
+out:
+	return 0;
+}
+
+/*
+ * Careful not to race with rds_release -> sock_orphan which clears sk_sleep.
+ * _bh() isn't OK here, we're called from interrupt handlers.  It's probably OK
+ * to wake the waitqueue after sk_sleep is clear as we hold a sock ref, but
+ * this seems more conservative.
+ * NB - normally, one would use sk_callback_lock for this, but we can
+ * get here from interrupts, whereas the network code grabs sk_callback_lock
+ * with _lock_bh only - so relying on sk_callback_lock introduces livelocks.
+ */
+void rds_wake_sk_sleep(struct rds_sock *rs)
+{
+	unsigned long flags;
+
+	read_lock_irqsave(&rs->rs_recv_lock, flags);
+	__rds_wake_sk_sleep(rds_rs_to_sk(rs));
+	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
+}
+
+static int rds_getname(struct socket *sock, struct sockaddr *uaddr,
+		       int *uaddr_len, int peer)
+{
+	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+
+	memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+
+	/* racey, don't care */
+	if (peer) {
+		if (!rs->rs_conn_addr)
+			return -ENOTCONN;
+
+		sin->sin_port = rs->rs_conn_port;
+		sin->sin_addr.s_addr = rs->rs_conn_addr;
+	} else {
+		sin->sin_port = rs->rs_bound_port;
+		sin->sin_addr.s_addr = rs->rs_bound_addr;
+	}
+
+	sin->sin_family = AF_INET;
+
+	*uaddr_len = sizeof(*sin);
+	return 0;
+}
+
+/*
+ * RDS' poll is without a doubt the least intuitive part of the interface,
+ * as POLLIN and POLLOUT do not behave entirely as you would expect from
+ * a network protocol.
+ *
+ * POLLIN is asserted if
+ *  -	there is data on the receive queue.
+ *  -	to signal that a previously congested destination may have become
+ *	uncongested
+ *  -	A notification has been queued to the socket (this can be a congestion
+ *	update, or a RDMA completion).
+ *
+ * POLLOUT is asserted if there is room on the send queue. This does not mean
+ * however, that the next sendmsg() call will succeed. If the application tries
+ * to send to a congested destination, the system call may still fail (and
+ * return ENOBUFS).
+ */
+static unsigned int rds_poll(struct file *file, struct socket *sock,
+			     poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	unsigned int mask = 0;
+	unsigned long flags;
+
+	poll_wait(file, sk->sk_sleep, wait);
+
+	poll_wait(file, &rds_poll_waitq, wait);
+
+	read_lock_irqsave(&rs->rs_recv_lock, flags);
+	if (!rs->rs_cong_monitor) {
+		/* When a congestion map was updated, we signal POLLIN for
+		 * "historical" reasons. Applications can also poll for
+		 * WRBAND instead. */
+		if (rds_cong_updated_since(&rs->rs_cong_track))
+			mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
+	} else {
+		spin_lock(&rs->rs_lock);
+		if (rs->rs_cong_notify)
+			mask |= (POLLIN | POLLRDNORM);
+		spin_unlock(&rs->rs_lock);
+	}
+	if (!list_empty(&rs->rs_recv_queue)
+	 || !list_empty(&rs->rs_notify_queue))
+		mask |= (POLLIN | POLLRDNORM);
+	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
+		mask |= (POLLOUT | POLLWRNORM);
+	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
+
+	return mask;
+}
+
+static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
+{
+#ifdef KERNEL_HAS_CORE_CALLING_DEV_IOCTL
+	return -ENOIOCTLCMD;
+#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */
+#ifndef KERNEL_HAS_CORE_CALLING_DEV_IOCTL
+	return dev_ioctl(cmd, (void __user *)arg);
+#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */
+}
+
+static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval,
+			      int len)
+{
+	struct sockaddr_in sin;
+	int ret = 0;
+
+	/* racing with another thread binding seems ok here */
+	if (rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	if (len < sizeof(struct sockaddr_in)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (copy_from_user(&sin, optval, sizeof(sin))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	rds_send_drop_to(rs, &sin);
+out:
+	return ret;
+}
+
+static int rds_set_bool_option(unsigned char *optvar, char __user *optval,
+			       int optlen)
+{
+	int value;
+
+	if (optlen < sizeof(int))
+		return -EINVAL;
+	if (get_user(value, (int __user *) optval))
+		return -EFAULT;
+	*optvar = !!value;
+	return 0;
+}
+
+static int rds_cong_monitor(struct rds_sock *rs, char __user *optval,
+			    int optlen)
+{
+	int ret;
+
+	ret = rds_set_bool_option(&rs->rs_cong_monitor, optval, optlen);
+	if (ret == 0) {
+		if (rs->rs_cong_monitor) {
+			rds_cong_add_socket(rs);
+		} else {
+			rds_cong_remove_socket(rs);
+			rs->rs_cong_mask = 0;
+			rs->rs_cong_notify = 0;
+		}
+	}
+	return ret;
+}
+
+static int rds_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int optlen)
+{
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+	int ret;
+
+	if (level != SOL_RDS) {
+		ret = -ENOPROTOOPT;
+		goto out;
+	}
+
+	switch (optname) {
+	case RDS_CANCEL_SENT_TO:
+		ret = rds_cancel_sent_to(rs, optval, optlen);
+		break;
+	case RDS_GET_MR:
+		if (enable_rdma)
+			ret = rds_get_mr(rs, optval, optlen);
+		else
+			ret = -EOPNOTSUPP;
+			break;
+	case RDS_FREE_MR:
+		if (enable_rdma)
+			ret = rds_free_mr(rs, optval, optlen);
+		else
+			ret = -EOPNOTSUPP;
+		break;
+	case RDS_RECVERR:
+		ret = rds_set_bool_option(&rs->rs_recverr, optval, optlen);
+		break;
+	case RDS_CONG_MONITOR:
+		ret = rds_cong_monitor(rs, optval, optlen);
+		break;
+	default:
+		ret = -ENOPROTOOPT;
+	}
+out:
+	return ret;
+}
+
+static int rds_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+	int ret = -ENOPROTOOPT, len;
+
+	if (level != SOL_RDS)
+		goto out;
+
+	if (get_user(len, optlen)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (optname) {
+	case RDS_INFO_FIRST ... RDS_INFO_LAST:
+		ret = rds_info_getsockopt(sock, optname, optval,
+					  optlen);
+		break;
+
+	case RDS_RECVERR:
+		if (len < sizeof(int))
+			ret = -EINVAL;
+		else
+		if (put_user(rs->rs_recverr, (int __user *) optval)
+		 || put_user(sizeof(int), optlen))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	default:
+		break;
+	}
+
+out:
+	return ret;
+
+}
+
+static int rds_connect(struct socket *sock, struct sockaddr *uaddr,
+		       int addr_len, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	int ret = 0;
+
+	lock_sock(sk);
+
+	if (addr_len != sizeof(struct sockaddr_in)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (sin->sin_family != AF_INET) {
+		ret = -EAFNOSUPPORT;
+		goto out;
+	}
+
+	if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
+		ret = -EDESTADDRREQ;
+		goto out;
+	}
+
+	rs->rs_conn_addr = sin->sin_addr.s_addr;
+	rs->rs_conn_port = sin->sin_port;
+
+out:
+	release_sock(sk);
+	return ret;
+}
+
+#ifdef KERNEL_HAS_PROTO_REGISTER
+static struct proto rds_proto = {
+	.name	  = "RDS",
+	.owner	  = THIS_MODULE,
+	.obj_size = sizeof(struct rds_sock),
+};
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+
+static struct proto_ops rds_proto_ops = {
+	.family =	AF_RDS,
+	.owner =	THIS_MODULE,
+	.release =	rds_release,
+	.bind =		rds_bind,
+	.connect =	rds_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	rds_getname,
+	.poll =		rds_poll,
+	.ioctl =	rds_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	rds_setsockopt,
+	.getsockopt =	rds_getsockopt,
+	.sendmsg =	rds_sendmsg,
+	.recvmsg =	rds_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+#ifndef KERNEL_HAS_PROTO_REGISTER
+static struct sock *sk_alloc_compat(int pf, gfp_t gfp, struct proto *prot,
+				    int zero_it)
+{
+	struct rds_sock *rs;
+
+	sk = sk_alloc(pf, gfp, prot, zero_it);
+	if (sk == NULL)
+		return NULL;
+
+	rs = kcalloc(1, sizeof(struct rds_sock), GFP_ATOMIC);
+	if (rs == NULL) {
+		sk_free(sk);
+		return NULL;
+	}
+
+	/* sock_def_destruct frees this for us */
+	sk->sk_protinfo = rs;
+	rs->rs_sk = sk;
+
+	return sk;
+}
+
+#undef sk_alloc
+#define sk_alloc sk_alloc_compat
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+
+static int __rds_create(struct socket *sock, struct sock *sk, int protocol)
+{
+	unsigned long flags;
+	struct rds_sock *rs;
+
+	sock_init_data(sock, sk);
+#ifndef KERNEL_HAS_PROTO_REGISTER
+	/* Can this be moved to sk_alloc_compat? */
+	sk_set_owner(sk, THIS_MODULE);
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+	sock->ops		= &rds_proto_ops;
+	sk->sk_protocol		= protocol;
+
+	rs = rds_sk_to_rs(sk);
+	spin_lock_init(&rs->rs_lock);
+	rwlock_init(&rs->rs_recv_lock);
+	INIT_LIST_HEAD(&rs->rs_send_queue);
+	INIT_LIST_HEAD(&rs->rs_recv_queue);
+	INIT_LIST_HEAD(&rs->rs_notify_queue);
+	INIT_LIST_HEAD(&rs->rs_cong_list);
+	spin_lock_init(&rs->rs_rdma_lock);
+	rs->rs_rdma_keys = RB_ROOT;
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+	list_add_tail(&rs->rs_item, &rds_sock_list);
+	rds_sock_count++;
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+
+	return 0;
+}
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
+static int rds_create(struct socket *sock, int protocol)
+{
+	struct sock *sk;
+
+	if (sock->type != SOCK_SEQPACKET || protocol)
+		return -ESOCKTNOSUPPORT;
+
+	sk = sk_alloc(AF_RDS, GFP_ATOMIC, &rds_proto, 1);
+	if (sk == NULL)
+		return -ENOMEM;
+
+	return __rds_create(sock, sk, protocol);
+}
+#else
+static int rds_create(struct net *net, struct socket *sock, int protocol)
+{
+	struct sock *sk;
+
+	if (sock->type != SOCK_SEQPACKET || protocol)
+		return -ESOCKTNOSUPPORT;
+
+	sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto);
+	if (!sk)
+		return -ENOMEM;
+
+	return __rds_create(sock, sk, protocol);
+}
+#endif
+
+void rds_sock_addref(struct rds_sock *rs)
+{
+	sock_hold(rds_rs_to_sk(rs));
+}
+
+void rds_sock_put(struct rds_sock *rs)
+{
+	sock_put(rds_rs_to_sk(rs));
+}
+
+static struct net_proto_family rds_family_ops = {
+	.family =	AF_RDS,
+	.create =	rds_create,
+	.owner	=	THIS_MODULE,
+};
+
+static void rds_sock_inc_info(struct socket *sock, unsigned int len,
+			      struct rds_info_iterator *iter,
+			      struct rds_info_lengths *lens)
+{
+	struct rds_sock *rs;
+	struct sock *sk;
+	struct rds_incoming *inc;
+	unsigned long flags;
+	unsigned int total = 0;
+
+	len /= sizeof(struct rds_info_message);
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		sk = rds_rs_to_sk(rs);
+		read_lock(&rs->rs_recv_lock);
+
+		/* XXX too lazy to maintain counts.. */
+		list_for_each_entry(inc, &rs->rs_recv_queue, i_item) {
+			total++;
+			if (total <= len)
+				rds_inc_info_copy(inc, iter, inc->i_saddr,
+						  rs->rs_bound_addr, 1);
+		}
+
+		read_unlock(&rs->rs_recv_lock);
+	}
+
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+
+	lens->nr = total;
+	lens->each = sizeof(struct rds_info_message);
+}
+
+static void rds_sock_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens)
+{
+	struct rds_info_socket sinfo;
+	struct rds_sock *rs;
+	unsigned long flags;
+
+	len /= sizeof(struct rds_info_socket);
+
+	spin_lock_irqsave(&rds_sock_lock, flags);
+
+	if (len < rds_sock_count)
+		goto out;
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		sinfo.sndbuf = rds_sk_sndbuf(rs);
+		sinfo.rcvbuf = rds_sk_rcvbuf(rs);
+		sinfo.bound_addr = rs->rs_bound_addr;
+		sinfo.connected_addr = rs->rs_conn_addr;
+		sinfo.bound_port = rs->rs_bound_port;
+		sinfo.connected_port = rs->rs_conn_port;
+		sinfo.inum = sock_i_ino(rds_rs_to_sk(rs));
+
+		rds_info_copy(iter, &sinfo, sizeof(sinfo));
+	}
+
+out:
+	lens->nr = rds_sock_count;
+	lens->each = sizeof(struct rds_info_socket);
+
+	spin_unlock_irqrestore(&rds_sock_lock, flags);
+}
+
+/*
+ * The order is important here.
+ *
+ * rds_trans_stop_listening() is called before conn_exit so new connections
+ * don't hit while existing ones are being torn down.
+ *
+ * rds_conn_exit() is before rds_trans_exit() as rds_conn_exit() calls into the
+ * transports to free connections and incoming fragments as they're torn down.
+ */
+static void __exit rds_exit(void)
+{
+	rds_ib_exit();
+	sock_unregister(rds_family_ops.family);
+#ifdef KERNEL_HAS_PROTO_REGISTER
+	proto_unregister(&rds_proto);
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+	rds_trans_stop_listening();
+	rds_conn_exit();
+	rds_cong_exit();
+	rds_sysctl_exit();
+	rds_threads_exit();
+	rds_stats_exit();
+	rds_page_exit();
+	rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info);
+	rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+}
+module_exit(rds_exit);
+
+static int __init rds_init(void)
+{
+	int ret;
+
+#ifndef KERNEL_HAS_NOT_DEFINED
+	/* the strange ifdef above has scripts/makepatch.sh strip this out */
+#if PF_RDS == 21
+	printk(KERN_ERR "!!! This build of RDS is using PF 21 which is not "
+	       "reserved\n");
+	printk(KERN_ERR "!!! This is only suitable for testing, DO NOT "
+	       "RELEASE THIS.\n");
+	printk(KERN_ERR "!!! No, seriously.\n");
+#endif
+#endif /* KERNEL_HAS_NOT_DEFINED */
+
+	ret = rds_conn_init();
+	if (ret)
+		goto out;
+	ret = rds_threads_init();
+	if (ret)
+		goto out_conn;
+	ret = rds_sysctl_init();
+	if (ret)
+		goto out_threads;
+	ret = rds_stats_init();
+	if (ret)
+		goto out_sysctl;
+#ifdef KERNEL_HAS_PROTO_REGISTER
+	ret = proto_register(&rds_proto, 1);
+	if (ret)
+		goto out_stats;
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+	ret = sock_register(&rds_family_ops);
+	if (ret)
+		goto out_proto;
+
+	rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info);
+	rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+
+	ret = rds_ib_init();
+	if (ret)
+		goto out_sock;
+	goto out;
+
+out_sock:
+	sock_unregister(rds_family_ops.family);
+out_proto:
+#ifdef KERNEL_HAS_PROTO_REGISTER
+	proto_unregister(&rds_proto);
+out_stats:
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+	rds_stats_exit();
+out_sysctl:
+	rds_sysctl_exit();
+out_threads:
+	rds_threads_exit();
+out_conn:
+	rds_conn_exit();
+	rds_cong_exit();
+	rds_page_exit();
+out:
+	return ret;
+}
+module_init(rds_init);
+
+#define DRV_VERSION     "4.0"
+#define DRV_RELDATE     "July 28, 2008"
+
+MODULE_AUTHOR("Zach Brown");
+MODULE_AUTHOR("Olaf Kirch");
+MODULE_DESCRIPTION("RDS: Reliable Datagram Sockets"
+		   " v" DRV_VERSION " (" DRV_RELDATE ")");
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_ALIAS_NETPROTO(PF_RDS);
diff --git a/drivers/infiniband/ulp/rds/bind.c b/drivers/infiniband/ulp/rds/bind.c
new file mode 100644
index 0000000..0f625d0
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/bind.c
@@ -0,0 +1,202 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <net/sock.h>
+#include <linux/in.h>
+#include <linux/if_arp.h>
+#include "rds.h"
+
+/*
+ * XXX this probably still needs more work.. no INADDR_ANY, and rbtrees aren't
+ * particularly zippy.
+ *
+ * This is now called for every incoming frame so we arguably care much more
+ * about it than we used to.
+ */
+static DEFINE_SPINLOCK(rds_bind_lock);
+static struct rb_root rds_bind_tree = RB_ROOT;
+
+static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port,
+					   struct rds_sock *insert)
+{
+	struct rb_node **p = &rds_bind_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct rds_sock *rs;
+	u64 cmp;
+	u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
+
+	while (*p) {
+		parent = *p;
+		rs = rb_entry(parent, struct rds_sock, rs_bound_node);
+
+		cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
+		      be16_to_cpu(rs->rs_bound_port);
+
+		if (needle < cmp)
+			p = &(*p)->rb_left;
+		else if (needle > cmp)
+			p = &(*p)->rb_right;
+		else
+			return rs;
+	}
+
+	if (insert) {
+		rb_link_node(&insert->rs_bound_node, parent, p);
+		rb_insert_color(&insert->rs_bound_node, &rds_bind_tree);
+	}
+	return NULL;
+}
+
+/*
+ * Return the rds_sock bound at the given local address.
+ *
+ * The rx path can race with rds_release.  We notice if rds_release() has
+ * marked this socket and don't return a rs ref to the rx path.
+ */
+struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
+{
+	struct rds_sock *rs;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rds_bind_lock, flags);
+	rs = rds_bind_tree_walk(addr, port, NULL);
+	if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
+		rds_sock_addref(rs);
+	else
+		rs = NULL;
+	spin_unlock_irqrestore(&rds_bind_lock, flags);
+
+	rdsdebug("returning rs %p for %u.%u.%u.%u:%u\n", rs, NIPQUAD(addr),
+		ntohs(port));
+	return rs;
+}
+
+/* returns -ve errno or +ve port */
+static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
+{
+	unsigned long flags;
+	int ret = -EADDRINUSE;
+	u16 rover, last;
+
+	if (*port != 0) {
+		rover = be16_to_cpu(*port);
+		last = rover;
+	} else {
+		rover = max_t(u16, net_random(), 2);
+		last = rover - 1;
+	}
+
+	spin_lock_irqsave(&rds_bind_lock, flags);
+
+	do {
+		if (rover == 0)
+			rover++;
+		if (rds_bind_tree_walk(addr, cpu_to_be16(rover), rs) == NULL) {
+			*port = cpu_to_be16(rover);
+			ret = 0;
+			break;
+		}
+	} while (rover++ != last);
+
+	if (ret == 0)  {
+		rs->rs_bound_addr = addr;
+		rs->rs_bound_port = *port;
+		rds_sock_addref(rs);
+
+		rdstrace(RDS_BIND, RDS_LOW,
+		  "rs %p binding to %u.%u.%u.%u:%d\n",
+		  rs, NIPQUAD(addr), (int)ntohs(*port));
+
+	}
+
+	spin_unlock_irqrestore(&rds_bind_lock, flags);
+
+	return ret;
+}
+
+void rds_remove_bound(struct rds_sock *rs)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&rds_bind_lock, flags);
+
+	if (rs->rs_bound_addr) {
+		rdstrace(RDS_BIND, RDS_LOW,
+		  "rs %p unbinding from %u.%u.%u.%u:%d\n",
+		  rs, NIPQUAD(rs->rs_bound_addr),
+		  (int)ntohs(rs->rs_bound_port));
+
+		rb_erase(&rs->rs_bound_node, &rds_bind_tree);
+		rds_sock_put(rs);
+		rs->rs_bound_addr = 0;
+	}
+
+	spin_unlock_irqrestore(&rds_bind_lock, flags);
+}
+
+int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	struct rds_transport *trans;
+	int ret = 0;
+
+	lock_sock(sk);
+
+	if (addr_len != sizeof(struct sockaddr_in) ||
+	    sin->sin_family != AF_INET ||
+	    rs->rs_bound_addr ||
+	    sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = rds_add_bound(rs, sin->sin_addr.s_addr, &sin->sin_port);
+	if (ret)
+		goto out;
+
+	trans = rds_trans_get_preferred(sin->sin_addr.s_addr);
+	if (trans == NULL) {
+		ret = -EADDRNOTAVAIL;
+		rds_remove_bound(rs);
+		goto out;
+	}
+
+	rs->rs_transport = trans;
+	ret = 0;
+
+out:
+	release_sock(sk);
+	return ret;
+}
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [ofa-general] [PATCH 02/21] RDS: Main header file
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
  2009-01-27  2:17 ` [ofa-general] [PATCH 01/21] RDS: Socket interface Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  7:34   ` Rémi Denis-Courmont
  2009-01-27 13:05   ` Evgeniy Polyakov
  2009-01-27  2:17 ` [PATCH 03/21] RDS: Congestion-handling code Andy Grover
                   ` (20 subsequent siblings)
  22 siblings, 2 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, rds-devel, general

RDS's main data structure definitions and exported functions.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/rds.h |  763 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 763 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/rds.h

diff --git a/drivers/infiniband/ulp/rds/rds.h b/drivers/infiniband/ulp/rds/rds.h
new file mode 100644
index 0000000..133a237
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/rds.h
@@ -0,0 +1,763 @@
+#ifndef _RDS_H
+#define _RDS_H
+
+#include <net/sock.h>
+#include <linux/scatterlist.h>
+#include <asm/atomic.h>
+
+#include <linux/mutex.h>
+#include "rds_rdma.h"
+
+/*
+ * RDS Network protocol version
+ */
+#define RDS_PROTOCOL_3_0	0x0300
+#define RDS_PROTOCOL_3_1	0x0301
+#define RDS_PROTOCOL_VERSION	RDS_PROTOCOL_3_1
+#define RDS_PROTOCOL_MAJOR(v)	((v) >> 8)
+#define RDS_PROTOCOL_MINOR(v)	((v) & 255)
+#define RDS_PROTOCOL(maj, min)	(((maj) << 8) | min)
+
+/*
+ * XXX randomly chosen, but at least seems to be unused:
+ * #               18464-18768 Unassigned
+ * We should do better.  We want a reserved port to discourage unpriv'ed
+ * userspace from listening.
+ */
+#define RDS_PORT	18634
+
+#ifndef AF_RDS
+#define AF_RDS          28      /* Reliable Datagram Socket     */
+#endif
+
+#ifndef PF_RDS
+#define PF_RDS          AF_RDS
+#endif
+
+#ifndef SOL_RDS
+#define SOL_RDS         272
+#endif
+
+#define KERNEL_HAS_PROTO_REGISTER 1
+#define KERNEL_HAS_INET_SK_RETURNING_INET_SOCK 1
+#define KERNEL_HAS_CORE_CALLING_DEV_IOCTL 1
+
+#ifdef ATOMIC64_INIT
+#define KERNEL_HAS_ATOMIC64
+#endif
+
+/* x86-64 doesn't include kmap_types.h from anywhere */
+#include <asm/kmap_types.h>
+#include <linux/highmem.h>
+
+#include "info.h"
+
+#ifdef DEBUG
+#define rdsdebug(fmt, args...) pr_debug("%s(): " fmt, __func__ , ##args)
+#else
+/* sigh, pr_debug() causes unused variable warnings */
+static inline void __attribute__ ((format (printf, 1, 2)))
+rdsdebug(char *fmt, ...)
+{
+}
+#endif
+
+/*
+ * RDS trace facilities
+ */
+enum {
+  RDS_BIND = 0,
+  RDS_CONG,
+  RDS_CONNECTION,
+  RDS_RDMA,
+  RDS_PAGE,
+  RDS_SEND,
+  RDS_RECV,
+  RDS_THREADS,
+  RDS_INFO,
+  RDS_MESSAGE,
+  RDS_IB,
+  RDS_IB_CM,
+  RDS_IB_RDMA,
+  RDS_IB_RING,
+  RDS_IB_RECV,
+  RDS_IB_SEND,
+  RDS_TCP,
+  RDS_TCP_CONNECT,
+  RDS_TCP_LISTEN,
+  RDS_TCP_RECV,
+  RDS_TCP_SEND
+};
+
+enum {
+  RDS_ALWAYS = 0,
+  RDS_MINIMAL,
+  RDS_LOW,
+  RDS_MEDIUM,
+  RDS_HIGH,
+  RDS_VERBOSE
+};
+
+
+#define rdstrace(fac, lvl, fmt, args...) do { 			\
+	if (test_bit(fac, &rds_sysctl_trace_flags) &&		\
+	  lvl <= rds_sysctl_trace_level)			\
+		printk("%s(): " fmt, __func__, ##args); 	\
+} while(0);
+
+/* XXX is there one of these somewhere? */
+#define ceil(x, y) \
+	({ unsigned long __x = (x), __y = (y); (__x + __y - 1) / __y; })
+
+#define RDS_FRAG_SHIFT	12
+#define RDS_FRAG_SIZE	((unsigned int)(1 << RDS_FRAG_SHIFT))
+
+#define RDS_CONG_MAP_BYTES	(65536 / 8)
+#define RDS_CONG_MAP_LONGS	(RDS_CONG_MAP_BYTES / sizeof(unsigned long))
+#define RDS_CONG_MAP_PAGES	(PAGE_ALIGN(RDS_CONG_MAP_BYTES) / PAGE_SIZE)
+#define RDS_CONG_MAP_PAGE_BITS	(PAGE_SIZE * 8)
+
+struct rds_cong_map {
+	struct rb_node		m_rb_node;
+	__be32			m_addr;
+	wait_queue_head_t	m_waitq;
+	struct list_head	m_conn_list;
+	unsigned long		m_page_addrs[RDS_CONG_MAP_PAGES];
+};
+
+
+/*
+ * This is how we will track the connection state:
+ * A connection is always in one of the following
+ * states. Updates to the state are atomic and imply
+ * a memory barrier.
+ */
+enum {
+	RDS_CONN_DOWN = 0,
+	RDS_CONN_CONNECTING,
+	RDS_CONN_DISCONNECTING,
+	RDS_CONN_UP,
+	RDS_CONN_ERROR,
+};
+
+/* Bits for c_flags */
+#define RDS_LL_SEND_FULL	0
+#define RDS_RECONNECT_PENDING	1
+
+struct rds_connection {
+	struct hlist_node	c_hash_node;
+	__be32			c_laddr;
+	__be32			c_faddr;
+	unsigned int		c_loopback:1;
+	struct rds_connection	*c_passive;
+
+	struct rds_cong_map	*c_lcong;
+	struct rds_cong_map	*c_fcong;
+
+	struct mutex		c_send_lock;	/* protect send ring */
+	struct rds_message	*c_xmit_rm;
+	unsigned long		c_xmit_sg;
+	unsigned int		c_xmit_hdr_off;
+	unsigned int		c_xmit_data_off;
+	unsigned int		c_xmit_rdma_sent;
+
+	spinlock_t		c_lock;		/* protect msg queues */
+	u64			c_next_tx_seq;
+	struct list_head	c_send_queue;
+	struct list_head	c_retrans;
+
+	u64			c_next_rx_seq;
+
+	struct rds_transport	*c_trans;
+	void			*c_transport_data;
+
+	atomic_t		c_state;
+	unsigned long		c_flags;
+	unsigned long		c_reconnect_jiffies;
+	struct delayed_work	c_send_w;
+	struct delayed_work	c_recv_w;
+	struct delayed_work	c_conn_w;
+	struct work_struct	c_down_w;
+	struct mutex		c_cm_lock;	/* protect conn state & cm */
+
+	struct list_head	c_map_item;
+	unsigned long		c_map_queued;
+	unsigned long		c_map_offset;
+	unsigned long		c_map_bytes;
+
+	unsigned int		c_unacked_packets;
+	unsigned int		c_unacked_bytes;
+
+	/* Protocol version */
+	unsigned int		c_version;
+};
+
+#define RDS_FLAG_CONG_BITMAP	0x01
+#define RDS_FLAG_ACK_REQUIRED	0x02
+#define RDS_FLAG_RETRANSMITTED	0x04
+#define RDS_MAX_ADV_CREDIT	255
+
+/*
+ * Maximum space available for extension headers.
+ */
+#define RDS_HEADER_EXT_SPACE	16
+
+struct rds_header {
+	__be64	h_sequence;
+	__be64	h_ack;
+	__be32	h_len;
+	__be16	h_sport;
+	__be16	h_dport;
+	u8	h_flags;
+	u8	h_credit;
+	u8	h_padding[4];
+	__sum16	h_csum;
+
+	u8	h_exthdr[RDS_HEADER_EXT_SPACE];
+};
+
+/*
+ * Reserved - indicates end of extensions
+ */
+#define RDS_EXTHDR_NONE		0
+
+/*
+ * This extension header is included in the very
+ * first message that is sent on a new connection,
+ * and identifies the protocol level. This will help
+ * rolling updates if a future change requires breaking
+ * the protocol.
+ * NB: This is no longer true for IB, where we do a version
+ * negotiation during the connection setup phase (protocol
+ * version information is included in the RDMA CM private data).
+ */
+#define RDS_EXTHDR_VERSION	1
+struct rds_ext_header_version {
+	__be32			h_version;
+};
+
+/*
+ * This extension header is included in the RDS message
+ * chasing an RDMA operation.
+ */
+#define RDS_EXTHDR_RDMA		2
+struct rds_ext_header_rdma {
+	__be32			h_rdma_rkey;
+};
+
+/*
+ * This extension header tells the peer about the
+ * destination <R_Key,offset> of the requested RDMA
+ * operation.
+ */
+#define RDS_EXTHDR_RDMA_DEST	3
+struct rds_ext_header_rdma_dest {
+	__be32			h_rdma_rkey;
+	__be32			h_rdma_offset;
+};
+
+#define __RDS_EXTHDR_MAX	16 /* for now */
+
+struct rds_incoming {
+	atomic_t		i_refcount;
+	struct list_head	i_item;
+	struct rds_connection	*i_conn;
+	struct rds_header	i_hdr;
+	unsigned long		i_rx_jiffies;
+	__be32			i_saddr;
+
+	rds_rdma_cookie_t	i_rdma_cookie;
+};
+
+/*
+ * m_sock_item and m_conn_item are on lists that are serialized under
+ * conn->c_lock.  m_sock_item has additional meaning in that once it is empty
+ * the message will not be put back on the retransmit list after being sent.
+ * messages that are canceled while being sent rely on this.
+ *
+ * m_inc is used by loopback so that it can pass an incoming message straight
+ * back up into the rx path.  It embeds a wire header which is also used by
+ * the send path, which is kind of awkward.
+ *
+ * m_sock_item indicates the message's presence on a socket's send or receive
+ * queue.  m_rs will point to that socket.
+ *
+ * m_daddr is used by cancellation to prune messages to a given destination.
+ *
+ * The RDS_MSG_ON_SOCK and RDS_MSG_ON_CONN flags are used to avoid lock
+ * nesting.  As paths iterate over messages on a sock, or conn, they must
+ * also lock the conn, or sock, to remove the message from those lists too.
+ * Testing the flag to determine if the message is still on the lists lets
+ * us avoid testing the list_head directly.  That means each path can use
+ * the message's list_head to keep it on a local list while juggling locks
+ * without confusing the other path.
+ *
+ * m_ack_seq is an optional field set by transports who need a different
+ * sequence number range to invalidate.  They can use this in a callback
+ * that they pass to rds_send_drop_acked() to see if each message has been
+ * acked.  The HAS_ACK_SEQ flag can be used to detect messages which haven't
+ * had ack_seq set yet.
+ */
+#define RDS_MSG_ON_SOCK		1
+#define RDS_MSG_ON_CONN		2
+#define RDS_MSG_HAS_ACK_SEQ	3
+#define RDS_MSG_ACK_REQUIRED	4
+#define RDS_MSG_RETRANSMITTED	5
+#define RDS_MSG_MAPPED		6
+#define RDS_MSG_PAGEVEC		7
+
+struct rds_message {
+	atomic_t		m_refcount;
+	struct list_head	m_sock_item;
+	struct list_head	m_conn_item;
+	struct rds_incoming	m_inc;
+	u64			m_ack_seq;
+	__be32			m_daddr;
+	unsigned long		m_flags;
+
+	/* Never access m_rs without holding m_rs_lock.
+	 * Lock nesting is
+	 *  rm->m_rs_lock
+	 *   -> rs->rs_lock
+	 */
+	spinlock_t		m_rs_lock;
+	struct rds_sock		*m_rs;
+	struct rds_rdma_op	*m_rdma_op;
+	rds_rdma_cookie_t	m_rdma_cookie;
+	struct rds_mr		*m_rdma_mr;
+	unsigned int		m_nents;
+	unsigned int		m_count;
+	struct scatterlist	m_sg[0];
+};
+
+/*
+ * The RDS notifier is used (optionally) to tell the application about
+ * completed RDMA operations. Rather than keeping the whole rds message
+ * around on the queue, we allocate a small notifier that is put on the
+ * socket's notifier_list. Notifications are delivered to the application
+ * through control messages.
+ */
+struct rds_notifier {
+	struct list_head	n_list;
+	uint64_t		n_user_token;
+	int			n_status;
+};
+
+/**
+ * struct rds_transport -  transport specific behavioural hooks
+ *
+ * @xmit: .xmit is called by rds_send_xmit() to tell the transport to send
+ *        part of a message.  The caller serializes on the send_sem so this
+ *        doesn't need to be reentrant for a given conn.  The header must be
+ *        sent before the data payload.  .xmit must be prepared to send a
+ *        message with no data payload.  .xmit should return the number of
+ *        bytes that were sent down the connection, including header bytes.
+ *        Returning 0 tells the caller that it doesn't need to perform any
+ *        additional work now.  This is usually the case when the transport has
+ *        filled the sending queue for its connection and will handle
+ *        triggering the rds thread to continue the send when space becomes
+ *        available.  Returning -EAGAIN tells the caller to retry the send
+ *        immediately.  Returning -ENOMEM tells the caller to retry the send at
+ *        some point in the future.
+ *
+ * @conn_shutdown: conn_shutdown stops traffic on the given connection.  Once
+ *                 it returns the connection can not call rds_recv_incoming().
+ *                 This will only be called once after conn_connect returns
+ *                 non-zero success and will The caller serializes this with
+ *                 the send and connecting paths (xmit_* and conn_*).  The
+ *                 transport is responsible for other serialization, including
+ *                 rds_recv_incoming().  This is called in process context but
+ *                 should try hard not to block.
+ *
+ * @xmit_cong_map: This asks the transport to send the local bitmap down the
+ * 		   given connection.  XXX get a better story about the bitmap
+ * 		   flag and header.
+ */
+
+struct rds_transport {
+	struct list_head	t_item;
+	struct module		*t_owner;
+	char			*t_name;
+	unsigned int		t_prefer_loopback:1;
+
+	int (*laddr_check)(__be32 addr);
+	int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp);
+	void (*conn_free)(void *data);
+	int (*conn_connect)(struct rds_connection *conn);
+	void (*conn_shutdown)(struct rds_connection *conn);
+	void (*xmit_prepare)(struct rds_connection *conn);
+	void (*xmit_complete)(struct rds_connection *conn);
+	int (*xmit)(struct rds_connection *conn, struct rds_message *rm,
+		    unsigned int hdr_off, unsigned int sg, unsigned int off);
+	int (*xmit_cong_map)(struct rds_connection *conn,
+			     struct rds_cong_map *map, unsigned long offset);
+	int (*xmit_rdma)(struct rds_connection *conn, struct rds_rdma_op *op);
+	int (*recv)(struct rds_connection *conn);
+	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iovec *iov,
+				size_t size);
+	void (*inc_purge)(struct rds_incoming *inc);
+	void (*inc_free)(struct rds_incoming *inc);
+	void (*listen_stop)(void);
+	unsigned int (*stats_info_copy)(struct rds_info_iterator *iter,
+					unsigned int avail);
+	void (*exit)(void);
+	void *(*get_mr)(struct scatterlist *sg, unsigned long nr_sg,
+			struct rds_sock *rs, u32 *key_ret);
+	void (*sync_mr)(void *trans_private, int direction);
+	void (*free_mr)(void *trans_private, int invalidate);
+	void (*flush_mrs)(void);
+};
+
+struct rds_sock {
+#ifdef KERNEL_HAS_PROTO_REGISTER
+	struct sock		rs_sk;
+#endif
+#ifndef KERNEL_HAS_PROTO_REGISTER
+	struct sock		*rs_sk;
+#endif
+
+	u64			rs_user_addr;
+	u64			rs_user_bytes;
+
+	/*
+	 * bound_addr used for both incoming and outgoing, no INADDR_ANY
+	 * support.
+	 */
+	struct rb_node		rs_bound_node;
+	__be32			rs_bound_addr;
+	__be32			rs_conn_addr;
+	__be16			rs_bound_port;
+	__be16			rs_conn_port;
+
+	/*
+	 * This is only used to communicate the transport between bind and
+	 * initiating connections.  All other trans use is referenced through
+	 * the connection.
+	 */
+	struct rds_transport    *rs_transport;
+
+	/*
+	 * rds_sendmsg caches the conn it used the last time around.
+	 * This helps avoid costly lookups.
+	 */
+	struct rds_connection	*rs_conn;
+
+	/* flag indicating we were congested or not */
+	int			rs_congested;
+
+	/* rs_lock protects all these adjacent members before the newline */
+	spinlock_t		rs_lock;
+	struct list_head	rs_send_queue;
+	u32			rs_snd_bytes;
+	int			rs_rcv_bytes;
+	struct list_head	rs_notify_queue;	/* currently used for failed RDMAs */
+
+	/* Congestion wake_up. If rs_cong_monitor is set, we use cong_mask
+	 * to decide whether the application should be woken up.
+	 * If not set, we use rs_cong_track to find out whether a cong map
+	 * update arrived.
+	 */
+	uint64_t		rs_cong_mask;
+	uint64_t		rs_cong_notify;
+	struct list_head	rs_cong_list;
+	unsigned long		rs_cong_track;
+
+	/*
+	 * rs_recv_lock protects the receive queue, and is
+	 * used to serialize with rds_release.
+	 */
+	rwlock_t		rs_recv_lock;
+	struct list_head	rs_recv_queue;
+
+	/* just for stats reporting */
+	struct list_head	rs_item;
+
+	/* these have their own lock */
+	spinlock_t		rs_rdma_lock;
+	struct rb_root		rs_rdma_keys;
+
+	/* Socket options - in case there will be more */
+	unsigned char		rs_recverr,
+				rs_cong_monitor;
+};
+
+#ifdef KERNEL_HAS_PROTO_REGISTER
+static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
+{
+	return container_of(sk, struct rds_sock, rs_sk);
+}
+static inline struct sock *rds_rs_to_sk(struct rds_sock *rs)
+{
+	return &rs->rs_sk;
+}
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+#ifndef KERNEL_HAS_PROTO_REGISTER
+static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
+{
+	return (struct rds_sock *)sk->sk_protinfo;
+}
+static inline struct sock *rds_rs_to_sk(const struct rds_sock *rs)
+{
+	return rs->rs_sk;
+}
+#endif /* KERNEL_HAS_PROTO_REGISTER */
+
+/*
+ * The stack assigns sk_sndbuf and sk_rcvbuf to twice the specified value
+ * to account for overhead.  We don't account for overhead, we just apply
+ * the number of payload bytes to the specified value.
+ */
+static inline int rds_sk_sndbuf(struct rds_sock *rs)
+{
+	return rds_rs_to_sk(rs)->sk_sndbuf / 2;
+}
+static inline int rds_sk_rcvbuf(struct rds_sock *rs)
+{
+	return rds_rs_to_sk(rs)->sk_rcvbuf / 2;
+}
+
+struct rds_statistics {
+	uint64_t	s_conn_reset;
+	uint64_t	s_recv_drop_bad_checksum;
+	uint64_t	s_recv_drop_old_seq;
+	uint64_t	s_recv_drop_no_sock;
+	uint64_t	s_recv_drop_dead_sock;
+	uint64_t	s_recv_deliver_raced;
+	uint64_t	s_recv_delivered;
+	uint64_t	s_recv_queued;
+	uint64_t	s_recv_immediate_retry;
+	uint64_t	s_recv_delayed_retry;
+	uint64_t	s_recv_ack_required;
+	uint64_t	s_recv_rdma_bytes;
+	uint64_t	s_recv_ping;
+	uint64_t	s_send_queue_empty;
+	uint64_t	s_send_queue_full;
+	uint64_t	s_send_sem_contention;
+	uint64_t	s_send_sem_queue_raced;
+	uint64_t	s_send_immediate_retry;
+	uint64_t	s_send_delayed_retry;
+	uint64_t	s_send_drop_acked;
+	uint64_t	s_send_ack_required;
+	uint64_t	s_send_queued;
+	uint64_t	s_send_rdma;
+	uint64_t	s_send_rdma_bytes;
+	uint64_t	s_send_pong;
+	uint64_t	s_page_remainder_hit;
+	uint64_t	s_page_remainder_miss;
+	uint64_t	s_copy_to_user;
+	uint64_t	s_copy_from_user;
+	uint64_t	s_cong_update_queued;
+	uint64_t	s_cong_update_received;
+	uint64_t	s_cong_send_error;
+	uint64_t	s_cong_send_blocked;
+};
+
+/* af_rds.c */
+void rds_sock_addref(struct rds_sock *rs);
+void rds_sock_put(struct rds_sock *rs);
+void rds_wake_sk_sleep(struct rds_sock *rs);
+static inline void __rds_wake_sk_sleep(struct sock *sk)
+{
+	wait_queue_head_t *waitq = sk->sk_sleep;
+
+	if (!sock_flag(sk, SOCK_DEAD) && waitq)
+		wake_up(waitq);
+}
+extern wait_queue_head_t rds_poll_waitq;
+
+
+/* bind.c */
+int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
+void rds_remove_bound(struct rds_sock *rs);
+struct rds_sock *rds_find_bound(__be32 addr, __be16 port);
+
+/* cong.c */
+int rds_cong_get_maps(struct rds_connection *conn);
+void rds_cong_add_conn(struct rds_connection *conn);
+void rds_cong_remove_conn(struct rds_connection *conn);
+void rds_cong_set_bit(struct rds_cong_map *map, __be16 port);
+void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port);
+int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock, struct rds_sock *rs);
+void rds_cong_queue_updates(struct rds_cong_map *map);
+void rds_cong_map_updated(struct rds_cong_map *map, uint64_t);
+int rds_cong_updated_since(unsigned long *recent);
+void rds_cong_add_socket(struct rds_sock *);
+void rds_cong_remove_socket(struct rds_sock *);
+void rds_cong_exit(void);
+struct rds_message *rds_cong_update_alloc(struct rds_connection *conn);
+
+/* conn.c */
+int __init rds_conn_init(void);
+void rds_conn_exit(void);
+struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp);
+struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr,
+			       struct rds_transport *trans, gfp_t gfp);
+void rds_conn_destroy(struct rds_connection *conn);
+void rds_conn_reset(struct rds_connection *conn);
+void rds_conn_drop(struct rds_connection *conn);
+void rds_for_each_conn_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens,
+			  int (*visitor)(struct rds_connection *, void *),
+			  size_t item_len);
+void __rds_conn_error(struct rds_connection *conn, const char *, ...)
+				__attribute__ ((format (printf, 2, 3)));
+#define rds_conn_error(conn, fmt...) \
+	__rds_conn_error(conn, KERN_WARNING "RDS: " fmt)
+
+static inline int
+rds_conn_transition(struct rds_connection *conn, int old, int new)
+{
+	return atomic_cmpxchg(&conn->c_state, old, new) == old;
+}
+
+static inline int
+rds_conn_state(struct rds_connection *conn)
+{
+	return atomic_read(&conn->c_state);
+}
+
+static inline int
+rds_conn_up(struct rds_connection *conn)
+{
+	return atomic_read(&conn->c_state) == RDS_CONN_UP;
+}
+
+/* message.c */
+struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp);
+struct rds_message *rds_message_copy_from_user(struct iovec *first_iov,
+					       size_t total_len);
+struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len);
+void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
+				 __be16 dport, u64 seq);
+int rds_message_add_extension(struct rds_header *hdr,
+			      unsigned int type, const void *data, unsigned int len);
+int rds_message_next_extension(struct rds_header *hdr,
+			       unsigned int *pos, void *buf, unsigned int *buflen);
+int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version);
+int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version);
+int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset);
+int rds_message_inc_copy_to_user(struct rds_incoming *inc,
+				 struct iovec *first_iov, size_t size);
+void rds_message_inc_purge(struct rds_incoming *inc);
+void rds_message_inc_free(struct rds_incoming *inc);
+void rds_message_addref(struct rds_message *rm);
+void rds_message_put(struct rds_message *rm);
+void rds_message_wait(struct rds_message *rm);
+void rds_message_unmapped(struct rds_message *rm);
+
+static inline void rds_message_make_checksum(struct rds_header *hdr)
+{
+	hdr->h_csum = 0;
+	hdr->h_csum = ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2);
+}
+
+static inline int rds_message_verify_checksum(const struct rds_header *hdr)
+{
+	return !hdr->h_csum || ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2) == 0;
+}
+
+
+/* page.c */
+int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes,
+			     gfp_t gfp);
+int rds_page_copy_user(struct page *page, unsigned long offset,
+		       void __user *ptr, unsigned long bytes,
+		       int to_user);
+#define rds_page_copy_to_user(page, offset, ptr, bytes) \
+	rds_page_copy_user(page, offset, ptr, bytes, 1)
+#define rds_page_copy_from_user(page, offset, ptr, bytes) \
+	rds_page_copy_user(page, offset, ptr, bytes, 0)
+void rds_page_exit(void);
+
+/* recv.c */
+void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
+		  __be32 saddr);
+void rds_inc_addref(struct rds_incoming *inc);
+void rds_inc_put(struct rds_incoming *inc);
+void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
+		       struct rds_incoming *inc, gfp_t gfp, enum km_type km);
+int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t size, int msg_flags);
+void rds_clear_recv_queue(struct rds_sock *rs);
+int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msg);
+void rds_inc_info_copy(struct rds_incoming *inc,
+		       struct rds_info_iterator *iter,
+		       __be32 saddr, __be32 daddr, int flip);
+
+/* send.c */
+int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t payload_len);
+void rds_send_reset(struct rds_connection *conn);
+int rds_send_xmit(struct rds_connection *conn);
+struct sockaddr_in;
+void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest);
+typedef int (*is_acked_func)(struct rds_message *rm, uint64_t ack);
+void rds_send_drop_acked(struct rds_connection *conn, u64 ack,
+			 is_acked_func is_acked);
+int rds_send_acked_before(struct rds_connection *conn, u64 seq);
+void rds_send_remove_from_sock(struct list_head *messages, int status);
+int rds_send_pong(struct rds_connection *conn, __be16 dport);
+struct rds_message *rds_send_get_message(struct rds_connection *,
+					 struct rds_rdma_op *);
+
+/* rdma.c */
+void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force);
+
+/* stats.c */
+DECLARE_PER_CPU(struct rds_statistics, rds_stats);
+#define rds_stats_inc_which(which, member) do {		\
+	per_cpu(which, get_cpu()).member++;		\
+	put_cpu();					\
+} while (0)
+#define rds_stats_inc(member) rds_stats_inc_which(rds_stats, member)
+#define rds_stats_add_which(which, member, count) do {		\
+	per_cpu(which, get_cpu()).member += count;	\
+	put_cpu();					\
+} while (0)
+#define rds_stats_add(member, count) rds_stats_add_which(rds_stats, member, count)
+int __init rds_stats_init(void);
+void rds_stats_exit(void);
+void rds_stats_info_copy(struct rds_info_iterator *iter,
+			 uint64_t *values, char **names, size_t nr);
+
+/* sysctl.c */
+int __init rds_sysctl_init(void);
+void rds_sysctl_exit(void);
+extern unsigned long rds_sysctl_sndbuf_min;
+extern unsigned long rds_sysctl_sndbuf_default;
+extern unsigned long rds_sysctl_sndbuf_max;
+extern unsigned long rds_sysctl_reconnect_min_jiffies;
+extern unsigned long rds_sysctl_reconnect_max_jiffies;
+extern unsigned int  rds_sysctl_max_unacked_packets;
+extern unsigned int  rds_sysctl_max_unacked_bytes;
+extern unsigned int  rds_sysctl_ping_enable;
+extern unsigned long rds_sysctl_trace_flags;
+extern unsigned int  rds_sysctl_trace_level;
+
+/* threads.c */
+int __init rds_threads_init(void);
+void rds_threads_exit(void);
+extern struct workqueue_struct *rds_wq;
+void rds_connect_worker(struct work_struct *);
+void rds_shutdown_worker(struct work_struct *);
+void rds_send_worker(struct work_struct *);
+void rds_recv_worker(struct work_struct *);
+void rds_connect_complete(struct rds_connection *conn);
+
+/* transport.c */
+int rds_trans_register(struct rds_transport *trans);
+void rds_trans_unregister(struct rds_transport *trans);
+struct rds_transport *rds_trans_get_preferred(__be32 addr);
+void rds_trans_stop_listening(void);
+unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter,
+				       unsigned int avail);
+int __init rds_trans_init(void);
+void rds_trans_exit(void);
+
+/* ib.c */
+void rds_ib_exit(void);
+int rds_ib_init(void);
+
+#endif
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 03/21] RDS: Congestion-handling code
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
  2009-01-27  2:17 ` [ofa-general] [PATCH 01/21] RDS: Socket interface Andy Grover
  2009-01-27  2:17 ` [ofa-general] [PATCH 02/21] RDS: Main header file Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  3:48   ` Stephen Hemminger
                     ` (2 more replies)
  2009-01-27  2:17 ` [PATCH 04/21] RDS: Transport code Andy Grover
                   ` (19 subsequent siblings)
  22 siblings, 3 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

RDS handles per-socket congestion by updating peers with a complete
congestion map (8KB). This code keeps track of these maps for itself
and ones received from peers.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/cong.c |  424 +++++++++++++++++++++++++++++++++++++
 1 files changed, 424 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/cong.c

diff --git a/drivers/infiniband/ulp/rds/cong.c b/drivers/infiniband/ulp/rds/cong.c
new file mode 100644
index 0000000..b7c49d2
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/cong.c
@@ -0,0 +1,424 @@
+/*
+ * Copyright (c) 2007 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+#include "rds.h"
+
+/*
+ * This file implements the receive side of the unconventional congestion
+ * management in RDS.
+ *
+ * Messages waiting in the receive queue on the receiving socket are accounted
+ * against the sockets SO_RCVBUF option value.  Only the payload bytes in the
+ * message are accounted for.  If the number of bytes queued equals or exceeds
+ * rcvbuf then the socket is congested.  All sends attempted to this socket's
+ * address should return block or return -EWOULDBLOCK.
+ *
+ * Applications are expected to be reasonably tuned such that this situation
+ * very rarely occurs.  An application encountering this "back-pressure" is
+ * considered a bug.
+ *
+ * This is implemented by having each node maintain bitmaps which indicate
+ * which ports on bound addresses are congested.  As the bitmap changes it is
+ * sent through all the connections which terminate in the local address of the
+ * bitmap which changed.
+ *
+ * The bitmaps are allocated as connections are brought up.  This avoids
+ * allocation in the interrupt handling path which queues messages on sockets.
+ * The dense bitmaps let transports send the entire bitmap on any bitmap change
+ * reasonably efficiently.  This is much easier to implement than some
+ * finer-grained communication of per-port congestion.  The sender does a very
+ * inexpensive bit test to test if the port it's about to send to is congested
+ * or not.
+ */
+
+/*
+ * Interaction with poll is a tad tricky. We want all processes stuck in
+ * poll to wake up and check whether a congested destination became uncongested.
+ * The really sad thing is we have no idea which destinations the application
+ * wants to send to - we don't even know which rds_connections are involved.
+ * So until we implement a more flexible rds poll interface, we have to make
+ * do with this:
+ * We maintain a global counter that is incremented each time a congestion map
+ * update is received. Each rds socket tracks this value, and if rds_poll
+ * finds that the saved generation number is smaller than the global generation
+ * number, it wakes up the process.
+ */
+static atomic_t		rds_cong_generation = ATOMIC_INIT(0);
+
+/*
+ * Congestion monitoring
+ */
+static LIST_HEAD(rds_cong_monitor);
+static DEFINE_RWLOCK(rds_cong_monitor_lock);
+
+/*
+ * Yes, a global lock.  It's used so infrequently that it's worth keeping it
+ * global to simplify the locking.  It's only used in the following
+ * circumstances:
+ *
+ *  - on connection buildup to associate a conn with its maps
+ *  - on map changes to inform conns of a new map to send
+ *
+ *  It's sadly ordered under the socket callback lock and the connection lock.
+ *  Receive paths can mark ports congested from interrupt context so the
+ *  lock masks interrupts.
+ */
+static DEFINE_SPINLOCK(rds_cong_lock);
+static struct rb_root rds_cong_tree = RB_ROOT;
+
+static struct rds_cong_map *rds_cong_tree_walk(__be32 addr,
+					       struct rds_cong_map *insert)
+{
+	struct rb_node **p = &rds_cong_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct rds_cong_map *map;
+
+	while (*p) {
+		parent = *p;
+		map = rb_entry(parent, struct rds_cong_map, m_rb_node);
+
+		if (addr < map->m_addr)
+			p = &(*p)->rb_left;
+		else if (addr > map->m_addr)
+			p = &(*p)->rb_right;
+		else
+			return map;
+	}
+
+	if (insert) {
+		rb_link_node(&insert->m_rb_node, parent, p);
+		rb_insert_color(&insert->m_rb_node, &rds_cong_tree);
+	}
+	return NULL;
+}
+
+/*
+ * There is only ever one bitmap for any address.  Connections try and allocate
+ * these bitmaps in the process getting pointers to them.  The bitmaps are only
+ * ever freed as the module is removed after all connections have been freed.
+ */
+static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
+{
+	struct rds_cong_map *map;
+	struct rds_cong_map *ret = NULL;
+	unsigned long zp;
+	unsigned long i;
+	unsigned long flags;
+
+	map = kzalloc(sizeof(struct rds_cong_map), GFP_KERNEL);
+	if (map == NULL)
+		return NULL;
+
+	map->m_addr = addr;
+	init_waitqueue_head(&map->m_waitq);
+	INIT_LIST_HEAD(&map->m_conn_list);
+
+	for (i = 0; i < RDS_CONG_MAP_PAGES; i++) {
+		zp = get_zeroed_page(GFP_KERNEL);
+		if (zp == 0)
+			goto out;
+		map->m_page_addrs[i] = zp;
+	}
+
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	ret = rds_cong_tree_walk(addr, map);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+
+	if (ret == NULL) {
+		ret = map;
+		map = NULL;
+	}
+
+out:
+	if (map) {
+		for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++)
+			free_page(map->m_page_addrs[i]);
+		kfree(map);
+	}
+
+	rdsdebug("map %p for addr %x\n", ret, be32_to_cpu(addr));
+
+	return ret;
+}
+
+/*
+ * Put the conn on its local map's list.  This is called when the conn is
+ * really added to the hash.  It's nested under the rds_conn_lock, sadly.
+ */
+void rds_cong_add_conn(struct rds_connection *conn)
+{
+	unsigned long flags;
+
+	rdsdebug("conn %p now on map %p\n", conn, conn->c_lcong);
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	list_add_tail(&conn->c_map_item, &conn->c_lcong->m_conn_list);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+}
+
+void rds_cong_remove_conn(struct rds_connection *conn)
+{
+	unsigned long flags;
+
+	rdsdebug("removing conn %p from map %p\n", conn, conn->c_lcong);
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	list_del_init(&conn->c_map_item);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+}
+
+int rds_cong_get_maps(struct rds_connection *conn)
+{
+	conn->c_lcong = rds_cong_from_addr(conn->c_laddr);
+	conn->c_fcong = rds_cong_from_addr(conn->c_faddr);
+
+	if (conn->c_lcong == NULL || conn->c_fcong == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void rds_cong_queue_updates(struct rds_cong_map *map)
+{
+	struct rds_connection *conn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rds_cong_lock, flags);
+
+	list_for_each_entry(conn, &map->m_conn_list, c_map_item) {
+		if (!test_and_set_bit(0, &conn->c_map_queued)) {
+			rds_stats_inc(s_cong_update_queued);
+			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+		}
+	}
+
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+}
+
+void rds_cong_map_updated(struct rds_cong_map *map, uint64_t portmask)
+{
+	rdsdebug("waking map %p\n", map);
+	rdstrace(RDS_CONG, RDS_LOW, "waking map %p for %u.%u.%u.%u\n",
+	  map, NIPQUAD(map->m_addr));
+	rds_stats_inc(s_cong_update_received);
+	atomic_inc(&rds_cong_generation);
+	if (waitqueue_active(&map->m_waitq))
+		wake_up(&map->m_waitq);
+	if (waitqueue_active(&rds_poll_waitq))
+		wake_up_all(&rds_poll_waitq);
+
+	if (portmask && !list_empty(&rds_cong_monitor)) {
+		unsigned long flags;
+		struct rds_sock *rs;
+
+		read_lock_irqsave(&rds_cong_monitor_lock, flags);
+		list_for_each_entry(rs, &rds_cong_monitor, rs_cong_list) {
+			spin_lock(&rs->rs_lock);
+			rs->rs_cong_notify |= (rs->rs_cong_mask & portmask);
+			rs->rs_cong_mask &= ~portmask;
+			spin_unlock(&rs->rs_lock);
+			if (rs->rs_cong_notify)
+				rds_wake_sk_sleep(rs);
+		}
+		read_unlock_irqrestore(&rds_cong_monitor_lock, flags);
+	}
+}
+EXPORT_SYMBOL_GPL(rds_cong_map_updated);
+
+int rds_cong_updated_since(unsigned long *recent)
+{
+	unsigned long gen = atomic_read(&rds_cong_generation);
+
+	if (likely(*recent == gen))
+		return 0;
+	*recent = gen;
+	return 1;
+}
+
+/*
+ * These should be using generic_{test,__{clear,set}}_le_bit() but some old
+ * kernels don't have them.  Sigh.
+ */
+#if defined(__BIG_ENDIAN)
+# define LE_BIT_XOR        ((BITS_PER_LONG-1) & ~0x7)
+#else
+# if !defined(__LITTLE_ENDIAN)
+#  error "asm/byteorder.h didn't define __BIG or __LITTLE _ENDIAN ?"
+# endif
+# define LE_BIT_XOR        0
+#endif
+
+/*
+ * We're called under the locking that protects the sockets receive buffer
+ * consumption.  This makes it a lot easier for the caller to only call us
+ * when it knows that an existing set bit needs to be cleared, and vice versa.
+ * We can't block and we need to deal with concurrent sockets working against
+ * the same per-address map.
+ */
+void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
+{
+	unsigned long i;
+	unsigned long off;
+
+	rdsdebug("setting port %u on map %p\n", be16_to_cpu(port), map);
+	rdstrace(RDS_CONG, RDS_LOW,
+	  "setting congestion for %u.%u.%u.%u:%u in map %p\n",
+	  NIPQUAD(map->m_addr), ntohs(port), map);
+
+	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
+	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
+
+	set_bit(off ^ LE_BIT_XOR, (void *)map->m_page_addrs[i]);
+}
+
+void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
+{
+	unsigned long i;
+	unsigned long off;
+
+	rdsdebug("clearing port %u on map %p\n", be16_to_cpu(port), map);
+	rdstrace(RDS_CONG, RDS_LOW,
+	  "clearing congestion for %u.%u.%u.%u:%u in map %p\n",
+	  NIPQUAD(map->m_addr), ntohs(port), map);
+
+	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
+	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
+
+	clear_bit(off ^ LE_BIT_XOR, (void *)map->m_page_addrs[i]);
+}
+
+static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
+{
+	unsigned long i;
+	unsigned long off;
+
+	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
+	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
+
+	return test_bit(off ^ LE_BIT_XOR, (void *)map->m_page_addrs[i]);
+}
+
+#undef LE_BIT_XOR
+
+void rds_cong_add_socket(struct rds_sock *rs)
+{
+	unsigned long flags;
+
+	write_lock_irqsave(&rds_cong_monitor_lock, flags);
+	if (list_empty(&rs->rs_cong_list))
+		list_add(&rs->rs_cong_list, &rds_cong_monitor);
+	write_unlock_irqrestore(&rds_cong_monitor_lock, flags);
+}
+
+void rds_cong_remove_socket(struct rds_sock *rs)
+{
+	unsigned long flags;
+	struct rds_cong_map *map;
+
+	write_lock_irqsave(&rds_cong_monitor_lock, flags);
+	list_del_init(&rs->rs_cong_list);
+	write_unlock_irqrestore(&rds_cong_monitor_lock, flags);
+
+	/* update congestion map for now-closed port */
+	spin_lock_irqsave(&rds_cong_lock, flags);
+	map = rds_cong_tree_walk(rs->rs_bound_addr, NULL);
+	spin_unlock_irqrestore(&rds_cong_lock, flags);
+
+	if (map && rds_cong_test_bit(map, rs->rs_bound_port))
+	{
+		rds_cong_clear_bit(map, rs->rs_bound_port);
+		rds_cong_queue_updates(map);
+	}
+}
+
+int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock,
+		  struct rds_sock *rs)
+{
+	if (!rds_cong_test_bit(map, port))
+		return 0;
+	if (nonblock) {
+		if (rs && rs->rs_cong_monitor) {
+			unsigned long flags;
+
+			/* It would have been nice to have an atomic set_bit on
+			 * a uint64_t. */
+			spin_lock_irqsave(&rs->rs_lock, flags);
+			rs->rs_cong_mask |= RDS_CONG_MONITOR_MASK(ntohs(port));
+			spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+			/* Test again - a congestion update may have arrived in
+			 * the meantime. */
+			if (!rds_cong_test_bit(map, port))
+				return 0;
+		}
+		rds_stats_inc(s_cong_send_error);
+		return -ENOBUFS;
+	}
+
+	rds_stats_inc(s_cong_send_blocked);
+	rdsdebug("waiting on map %p for port %u\n", map, be16_to_cpu(port));
+
+	return wait_event_interruptible(map->m_waitq,
+					!rds_cong_test_bit(map, port));
+}
+
+void rds_cong_exit(void)
+{
+	struct rb_node *node;
+	struct rds_cong_map *map;
+	unsigned long i;
+
+	while ((node = rb_first(&rds_cong_tree))) {
+		map = rb_entry(node, struct rds_cong_map, m_rb_node);
+		rdsdebug("freeing map %p\n", map);
+		rb_erase(&map->m_rb_node, &rds_cong_tree);
+		for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++)
+			free_page(map->m_page_addrs[i]);
+		kfree(map);
+	}
+}
+
+/*
+ * Allocate a RDS message containing a congestion update.
+ */
+struct rds_message *rds_cong_update_alloc(struct rds_connection *conn)
+{
+	struct rds_cong_map *map = conn->c_lcong;
+	struct rds_message *rm;
+
+	rm = rds_message_map_pages(map->m_page_addrs, RDS_CONG_MAP_BYTES);
+	if (!IS_ERR(rm))
+		rm->m_inc.i_hdr.h_flags = RDS_FLAG_CONG_BITMAP;
+
+	return rm;
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 04/21] RDS: Transport code
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (2 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 03/21] RDS: Congestion-handling code Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27 13:18   ` Evgeniy Polyakov
  2009-01-27  2:17 ` [ofa-general] [PATCH 05/21] RDS: Info and stats Andy Grover
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

RDS supports multiple transports. While this initial submission
only supports Infiniband transport, this abstraction allows others
to be added. We're working on an iWARP transport, and also see
UDP over DCB as another possibility.

This code handles transport registration.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/transport.c |  134 ++++++++++++++++++++++++++++++++
 1 files changed, 134 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/transport.c

diff --git a/drivers/infiniband/ulp/rds/transport.c b/drivers/infiniband/ulp/rds/transport.c
new file mode 100644
index 0000000..e78f8b3
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/transport.c
@@ -0,0 +1,134 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/in.h>
+
+#include "rds.h"
+#include "loop.h"
+
+static LIST_HEAD(transports);
+static DECLARE_RWSEM(trans_sem);
+
+int rds_trans_register(struct rds_transport *trans)
+{
+	BUG_ON(strlen(trans->t_name) + 1 >
+	       sizeof(((struct rds_info_connection *)0)->transport));
+
+	down_write(&trans_sem);
+
+	list_add_tail(&trans->t_item, &transports);
+	printk(KERN_INFO "Registered RDS/%s transport\n", trans->t_name);
+
+	up_write(&trans_sem);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(rds_trans_register);
+
+void rds_trans_unregister(struct rds_transport *trans)
+{
+	down_write(&trans_sem);
+
+	list_del_init(&trans->t_item);
+	printk(KERN_INFO "Unregistered RDS/%s transport\n", trans->t_name);
+
+	up_write(&trans_sem);
+}
+EXPORT_SYMBOL_GPL(rds_trans_unregister);
+
+struct rds_transport *rds_trans_get_preferred(__be32 addr)
+{
+	struct rds_transport *trans;
+	struct rds_transport *ret = NULL;
+
+        if (IN_LOOPBACK(ntohl(addr)))
+                return &rds_loop_transport;
+
+	down_read(&trans_sem);
+	list_for_each_entry(trans, &transports, t_item) {
+		if (trans->laddr_check(addr) == 0) {
+			ret = trans;
+			break;
+		}
+	}
+	up_read(&trans_sem);
+
+	return ret;
+}
+
+/* TODO remove this, only called in rds_exit, by which point we know
+   all the transports will have been unloaded */
+void rds_trans_stop_listening(void)
+{
+	struct rds_transport *trans;
+
+	down_read(&trans_sem);
+
+	list_for_each_entry(trans, &transports, t_item)
+		trans->listen_stop();
+
+	up_read(&trans_sem);
+}
+
+/*
+ * This returns the number of stats entries in the snapshot and only
+ * copies them using the iter if there is enough space for them.  The
+ * caller passes in the global stats so that we can size and copy while
+ * holding the lock.
+ */
+unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter,
+				       unsigned int avail)
+
+{
+	struct rds_transport *trans;
+	unsigned int total = 0;
+	unsigned int part;
+
+	rds_info_iter_unmap(iter);
+	down_read(&trans_sem);
+
+	list_for_each_entry(trans, &transports, t_item) {
+		if (trans->stats_info_copy == NULL)
+			continue;
+
+		part = trans->stats_info_copy(iter, avail);
+		avail -= min(avail, part);
+		total += part;
+	}
+
+	up_read(&trans_sem);
+
+	return total;
+}
+
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [ofa-general] [PATCH 05/21] RDS: Info and stats
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (3 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 04/21] RDS: Transport code Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27 13:28   ` Evgeniy Polyakov
  2009-01-27  2:17 ` [PATCH 06/21] RDS: Connection handling Andy Grover
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, rds-devel, general

RDS currently generates a lot of stats that are accessible via
the rds-info utility. This code implements the support for this.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/info.c  |  243 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/info.h  |   43 +++++++
 drivers/infiniband/ulp/rds/stats.c |  150 ++++++++++++++++++++++
 3 files changed, 436 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/info.c
 create mode 100644 drivers/infiniband/ulp/rds/info.h
 create mode 100644 drivers/infiniband/ulp/rds/stats.c

diff --git a/drivers/infiniband/ulp/rds/info.c b/drivers/infiniband/ulp/rds/info.c
new file mode 100644
index 0000000..ff3ba1c
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/info.c
@@ -0,0 +1,243 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+
+/*
+ * This file implements a getsockopt() call which copies a set of fixed
+ * sized structs into a user-specified buffer as a means of providing
+ * read-only information about RDS.
+ *
+ * For a given information source there are a given number of fixed sized
+ * structs at a given time.  The structs are only copied if the user-specified
+ * buffer is big enough.  The destination pages that make up the buffer
+ * are pinned for the duration of the copy.
+ *
+ * This gives us the following benefits:
+ *
+ * - simple implementation, no copy "position" across multiple calls
+ * - consistent snapshot of an info source
+ * - atomic copy works well with whatever locking info source has
+ * - one portable tool to get rds info across implementations
+ * - long-lived tool can get info without allocating
+ *
+ * at the following costs:
+ *
+ * - info source copy must be pinned, may be "large"
+ */
+
+struct rds_info_iterator {
+	struct page **pages;
+	void *addr;
+	unsigned long offset;
+};
+
+static DEFINE_SPINLOCK(rds_info_lock);
+static rds_info_func rds_info_funcs[RDS_INFO_LAST - RDS_INFO_FIRST + 1];
+
+void rds_info_register_func(int optname, rds_info_func func)
+{
+	int offset = optname - RDS_INFO_FIRST;
+
+	BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST);
+
+	spin_lock(&rds_info_lock);
+	BUG_ON(rds_info_funcs[offset] != NULL);
+	rds_info_funcs[offset] = func;
+	spin_unlock(&rds_info_lock);
+}
+EXPORT_SYMBOL_GPL(rds_info_register_func);
+
+void rds_info_deregister_func(int optname, rds_info_func func)
+{
+	int offset = optname - RDS_INFO_FIRST;
+
+	BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST);
+
+	spin_lock(&rds_info_lock);
+	BUG_ON(rds_info_funcs[offset] != func);
+	rds_info_funcs[offset] = NULL;
+	spin_unlock(&rds_info_lock);
+}
+EXPORT_SYMBOL_GPL(rds_info_deregister_func);
+
+/*
+ * Typically we hold an atomic kmap across multiple rds_info_copy() calls
+ * because the kmap is so expensive.  This must be called before using blocking
+ * operations while holding the mapping and as the iterator is torn down.
+ */
+void rds_info_iter_unmap(struct rds_info_iterator *iter)
+{
+	if (iter->addr != NULL) {
+		kunmap_atomic(iter->addr, KM_USER0);
+		iter->addr = NULL;
+	}
+}
+
+/*
+ * get_user_pages() called flush_dcache_page() on the pages for us.
+ */
+void rds_info_copy(struct rds_info_iterator *iter, void *data,
+		   unsigned long bytes)
+{
+	unsigned long this;
+
+	while (bytes) {
+		if (iter->addr == NULL)
+			iter->addr = kmap_atomic(*iter->pages, KM_USER0);
+
+		this = min(bytes, PAGE_SIZE - iter->offset);
+
+		rdsdebug("page %p addr %p offset %lu this %lu data %p "
+			  "bytes %lu\n", *iter->pages, iter->addr,
+			  iter->offset, this, data, bytes);
+
+		memcpy(iter->addr + iter->offset, data, this);
+
+		data += this;
+		bytes -= this;
+		iter->offset += this;
+
+		if (iter->offset == PAGE_SIZE) {
+			kunmap_atomic(iter->addr, KM_USER0);
+			iter->addr = NULL;
+			iter->offset = 0;
+			iter->pages++;
+		}
+	}
+}
+
+/*
+ * @optval points to the userspace buffer that the information snapshot
+ * will be copied into.
+ *
+ * @optlen on input is the size of the buffer in userspace.  @optlen
+ * on output is the size of the requested snapshot in bytes.
+ *
+ * This function returns -errno if there is a failure, particularly -ENOSPC
+ * if the given userspace buffer was not large enough to fit the snapshot.
+ * On success it returns the positive number of bytes of each array element
+ * in the snapshot.
+ */
+int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval,
+			int __user *optlen)
+{
+	struct rds_info_iterator iter;
+	struct rds_info_lengths lens;
+	unsigned long nr_pages = 0;
+	unsigned long start;
+	unsigned long i;
+	rds_info_func func;
+	struct page **pages = NULL;
+	int ret;
+	int len;
+	int total;
+
+	if (get_user(len, optlen)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* check for all kinds of wrapping and the like */
+	start = (unsigned long)optval;
+	if (len < 0 || len + PAGE_SIZE - 1 < len || start + len < start) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* a 0 len call is just trying to probe its length */
+	if (len == 0)
+		goto call_func;
+
+	nr_pages = (PAGE_ALIGN(start + len) - (start & PAGE_MASK))
+			>> PAGE_SHIFT;
+
+	pages = kmalloc(nr_pages * sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	down_read(&current->mm->mmap_sem);
+	ret = get_user_pages(current, current->mm, start, nr_pages, 1, 0,
+			     pages, NULL);
+	up_read(&current->mm->mmap_sem);
+	if (ret != nr_pages) {
+		if (ret > 0)
+			nr_pages = ret;
+		else
+			nr_pages = 0;
+		ret = -EAGAIN; /* XXX ? */
+		goto out;
+	}
+
+	rdsdebug("len %d nr_pages %lu\n", len, nr_pages);
+
+call_func:
+	func = rds_info_funcs[optname - RDS_INFO_FIRST];
+	if (func == NULL) {
+		ret = -ENOPROTOOPT;
+		goto out;
+	}
+
+	iter.pages = pages;
+	iter.addr = NULL;
+	iter.offset = start & (PAGE_SIZE - 1);
+
+	func(sock, len, &iter, &lens);
+	BUG_ON(lens.each == 0);
+
+	total = lens.nr * lens.each;
+
+	rds_info_iter_unmap(&iter);
+
+	if (total > len) {
+		len = total;
+		ret = -ENOSPC;
+	} else {
+		len = total;
+		ret = lens.each;
+	}
+
+	if (put_user(len, optlen))
+		ret = -EFAULT;
+
+out:
+	for (i = 0; pages != NULL && i < nr_pages; i++)
+		put_page(pages[i]);
+	kfree(pages);
+
+	return ret;
+}
diff --git a/drivers/infiniband/ulp/rds/info.h b/drivers/infiniband/ulp/rds/info.h
new file mode 100644
index 0000000..dd0c285
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/info.h
@@ -0,0 +1,43 @@
+#ifndef _RDS_INFO_H
+#define _RDS_INFO_H
+
+/* FIXME remove these */
+#define RDS_INFO_COUNTERS		10000
+#define RDS_INFO_CONNECTIONS		10001
+#define RDS_INFO_FLOWS			10002
+#define RDS_INFO_SEND_MESSAGES		10003
+#define RDS_INFO_RETRANS_MESSAGES	10004
+#define RDS_INFO_RECV_MESSAGES		10005
+#define RDS_INFO_SOCKETS		10006
+#define RDS_INFO_TCP_SOCKETS		10007
+
+#define RDS_INFO_FIRST		RDS_INFO_COUNTERS
+#define RDS_INFO_LAST		RDS_INFO_CONNECTION_STATS
+
+struct rds_info_lengths {
+	unsigned int	nr;
+	unsigned int	each;
+};
+
+struct rds_info_iterator;
+
+/*
+ * These functions must fill in the fields of @lens to reflect the size
+ * of the available info source.  If the snapshot fits in @len then it
+ * should be copied using @iter.  The caller will deduce if it was copied
+ * or not by comparing the lengths.
+ */
+typedef void (*rds_info_func)(struct socket *sock, unsigned int len,
+			      struct rds_info_iterator *iter,
+			      struct rds_info_lengths *lens);
+
+void rds_info_register_func(int optname, rds_info_func func);
+void rds_info_deregister_func(int optname, rds_info_func func);
+int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval,
+			int __user *optlen);
+void rds_info_copy(struct rds_info_iterator *iter, void *data,
+		   unsigned long bytes);
+void rds_info_iter_unmap(struct rds_info_iterator *iter);
+
+
+#endif
diff --git a/drivers/infiniband/ulp/rds/stats.c b/drivers/infiniband/ulp/rds/stats.c
new file mode 100644
index 0000000..74f1f96
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/stats.c
@@ -0,0 +1,150 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+
+DEFINE_PER_CPU(struct rds_statistics, rds_stats) ____cacheline_aligned;
+EXPORT_PER_CPU_SYMBOL_GPL(rds_stats);
+
+/* :.,$s/unsigned long\>.*\<s_\(.*\);/"\1",/g */
+
+static char *rds_stat_names[] = {
+	"conn_reset",
+	"recv_drop_bad_checksum",
+	"recv_drop_old_seq",
+	"recv_drop_no_sock",
+	"recv_drop_dead_sock",
+	"recv_deliver_raced",
+	"recv_delivered",
+	"recv_queued",
+	"recv_immediate_retry",
+	"recv_delayed_retry",
+	"recv_ack_required",
+	"recv_rdma_bytes",
+	"recv_ping",
+	"send_queue_empty",
+	"send_queue_full",
+	"send_sem_contention",
+	"send_sem_queue_raced",
+	"send_immediate_retry",
+	"send_delayed_retry",
+	"send_drop_acked",
+	"send_ack_required",
+	"send_queued",
+	"send_rdma",
+	"send_rdma_bytes",
+	"send_pong",
+	"page_remainder_hit",
+	"page_remainder_miss",
+	"copy_to_user",
+	"copy_from_user",
+	"cong_update_queued",
+	"cong_update_received",
+	"cong_send_error",
+	"cong_send_blocked",
+};
+
+void rds_stats_info_copy(struct rds_info_iterator *iter,
+			 uint64_t *values, char **names, size_t nr)
+{
+	struct rds_info_counter ctr;
+	size_t i;
+
+	for (i = 0; i < nr; i++) {
+		BUG_ON(strlen(names[i]) >= sizeof(ctr.name));
+		strncpy(ctr.name, names[i], sizeof(ctr.name) - 1);
+		ctr.value = values[i];
+
+		rds_info_copy(iter, &ctr, sizeof(ctr));
+	}
+}
+EXPORT_SYMBOL_GPL(rds_stats_info_copy);
+
+/*
+ * This gives global counters across all the transports.  The strings
+ * are copied in so that the tool doesn't need knowledge of the specific
+ * stats that we're exporting.  Some are pretty implementation dependent
+ * and may change over time.  That doesn't stop them from being useful.
+ *
+ * This is the only function in the chain that knows about the byte granular
+ * length in userspace.  It converts it to number of stat entries that the
+ * rest of the functions operate in.
+ */
+static void rds_stats_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	struct rds_statistics stats = {0, };
+	uint64_t *src;
+	uint64_t *sum;
+	size_t i;
+	int cpu;
+	unsigned int avail;
+
+	avail = len / sizeof(struct rds_info_counter);
+
+	if (avail < ARRAY_SIZE(rds_stat_names)) {
+		avail = 0;
+		goto trans;
+	}
+
+	for_each_online_cpu(cpu) {
+		src = (uint64_t *)&(per_cpu(rds_stats, cpu));
+		sum = (uint64_t *)&stats;
+		for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++)
+			*(sum++) += *(src++);
+	}
+
+	rds_stats_info_copy(iter, (uint64_t *)&stats, rds_stat_names,
+			    ARRAY_SIZE(rds_stat_names));
+	avail -= ARRAY_SIZE(rds_stat_names);
+
+trans:
+	lens->each = sizeof(struct rds_info_counter);
+	lens->nr = rds_trans_stats_info_copy(iter, avail) +
+		   ARRAY_SIZE(rds_stat_names);
+}
+
+void rds_stats_exit(void)
+{
+	rds_info_deregister_func(RDS_INFO_COUNTERS, rds_stats_info);
+}
+
+int __init rds_stats_init(void)
+{
+	rds_info_register_func(RDS_INFO_COUNTERS, rds_stats_info);
+	return 0;
+}
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 06/21] RDS: Connection handling
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (4 preceding siblings ...)
  2009-01-27  2:17 ` [ofa-general] [PATCH 05/21] RDS: Info and stats Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27 13:34   ` Evgeniy Polyakov
  2009-01-27  2:17 ` [PATCH 07/21] RDS: loopback Andy Grover
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

While arguably the fact that the underlying transport needs a
connection to convey RDS's datagrame reliably is not important
to rds proper, the transports implemented so far (IB and TCP)
have both been connection-oriented, and so the connection
state machine-related code is in the common rds code.

This patch also includes several work items, to handle connecting,
sending, receiving, and shutdown.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/connection.c |  501 +++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/threads.c    |  273 +++++++++++++++++
 2 files changed, 774 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/connection.c
 create mode 100644 drivers/infiniband/ulp/rds/threads.c

diff --git a/drivers/infiniband/ulp/rds/connection.c b/drivers/infiniband/ulp/rds/connection.c
new file mode 100644
index 0000000..6174629
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/connection.c
@@ -0,0 +1,501 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <net/inet_hashtables.h>
+
+#include "rds.h"
+#include "loop.h"
+#include "rdma.h"
+
+#define RDS_CONNECTION_HASH_BITS 12
+#define RDS_CONNECTION_HASH_ENTRIES (1 << RDS_CONNECTION_HASH_BITS)
+#define RDS_CONNECTION_HASH_MASK (RDS_CONNECTION_HASH_ENTRIES - 1)
+
+/* converting this to RCU is a chore for another day.. */
+static DEFINE_SPINLOCK(rds_conn_lock);
+static unsigned long rds_conn_count;
+static struct hlist_head rds_conn_hash[RDS_CONNECTION_HASH_ENTRIES];
+static struct kmem_cache *rds_conn_slab;
+
+static struct hlist_head *rds_conn_bucket(__be32 laddr, __be32 faddr)
+{
+	/* Pass NULL, don't need struct net for hash */
+	unsigned long hash = inet_ehashfn(NULL,
+					  be32_to_cpu(laddr), 0,
+					  be32_to_cpu(faddr), 0);
+	return &rds_conn_hash[hash & RDS_CONNECTION_HASH_MASK];
+}
+
+#define rds_conn_info_set(var, test, suffix) do {		\
+	if (test)						\
+		var |= RDS_INFO_CONNECTION_FLAG_##suffix;	\
+} while (0)
+
+static inline int rds_conn_is_sending(struct rds_connection *conn)
+{
+	int ret = 0;
+
+	if (!mutex_trylock(&conn->c_send_lock))
+		ret = 1;
+	else
+		mutex_unlock(&conn->c_send_lock);
+
+	return ret;
+}
+
+static struct rds_connection *rds_conn_lookup(struct hlist_head *head,
+					      __be32 laddr, __be32 faddr,
+					      struct rds_transport *trans)
+{
+	struct rds_connection *conn, *ret = NULL;
+	struct hlist_node *pos;
+
+	hlist_for_each_entry(conn, pos, head, c_hash_node) {
+		if (conn->c_faddr == faddr && conn->c_laddr == laddr &&
+				conn->c_trans == trans) {
+			ret = conn;
+			break;
+		}
+	}
+	rdsdebug("returning conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n", ret,
+		 NIPQUAD(laddr), NIPQUAD(faddr));
+	return ret;
+}
+
+/*
+ * This is called by transports as they're bringing down a connection.
+ * It clears partial message state so that the transport can start sending
+ * and receiving over this connection again in the future.  It is up to
+ * the transport to have serialized this call with its send and recv.
+ */
+void rds_conn_reset(struct rds_connection *conn)
+{
+	rdstrace(RDS_CONNECTION, RDS_MINIMAL,
+	  "connection %u.%u.%u.%u to %u.%u.%u.%u reset\n",
+	  NIPQUAD(conn->c_laddr),NIPQUAD(conn->c_faddr));
+
+	rds_stats_inc(s_conn_reset);
+	rds_send_reset(conn);
+	conn->c_flags = 0;
+
+	/* Do not clear next_rx_seq here, else we cannot distinguish
+	 * retransmitted packets from new packets, and will hand all
+	 * of them to the application. That is not consistent with the
+	 * reliability guarantees of RDS. */
+}
+
+/*
+ * There is only every one 'conn' for a given pair of addresses in the
+ * system at a time.  They contain messages to be retransmitted and so
+ * span the lifetime of the actual underlying transport connections.
+ *
+ * For now they are not garbage collected once they're created.  They
+ * are torn down as the module is removed, if ever.
+ */
+static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp,
+				       int is_outgoing)
+{
+	struct rds_connection *conn, *tmp, *parent = NULL;
+	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+	conn = rds_conn_lookup(head, laddr, faddr, trans);
+	if (conn
+	 && conn->c_loopback
+	 && conn->c_trans != &rds_loop_transport
+	 && !is_outgoing) {
+		/* This is a looped back IB connection, and we're
+		 * called by the code handling the incoming connect.
+		 * We need a second connection object into which we
+		 * can stick the other QP. */
+		parent = conn;
+		conn = parent->c_passive;
+	}
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+	if (conn)
+		goto out;
+
+	conn = kmem_cache_alloc(rds_conn_slab, gfp);
+	if (conn == NULL) {
+		conn = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	memset(conn, 0, sizeof(*conn));
+
+	INIT_HLIST_NODE(&conn->c_hash_node);
+	conn->c_version = RDS_PROTOCOL_3_0;
+	conn->c_laddr = laddr;
+	conn->c_faddr = faddr;
+	spin_lock_init(&conn->c_lock);
+	conn->c_next_tx_seq = 1;
+
+	mutex_init(&conn->c_send_lock);
+	INIT_LIST_HEAD(&conn->c_send_queue);
+	INIT_LIST_HEAD(&conn->c_retrans);
+
+	ret = rds_cong_get_maps(conn);
+	if (ret) {
+		kmem_cache_free(rds_conn_slab, conn);
+		conn = ERR_PTR(ret);
+		goto out;
+	}
+
+	/*
+	 * This is where a connection becomes loopback.  If *any* RDS sockets
+	 * can bind to the destination address then we'd rather the messages
+	 * flow through loopback rather than either transport.
+	 */
+	if (rds_trans_get_preferred(faddr)) {
+		conn->c_loopback = 1;
+		if (is_outgoing && trans->t_prefer_loopback) {
+			/* "outgoing" connection - and the transport
+			 * says it wants the connection handled by the
+			 * loopback transport. This is what TCP does.
+			 */
+			trans = &rds_loop_transport;
+		}
+	}
+
+	conn->c_trans = trans;
+
+	ret = trans->conn_alloc(conn, gfp);
+	if (ret) {
+		kmem_cache_free(rds_conn_slab, conn);
+		conn = ERR_PTR(ret);
+		goto out;
+	}
+
+	atomic_set(&conn->c_state, RDS_CONN_DOWN);
+	conn->c_reconnect_jiffies = 0;
+	INIT_DELAYED_WORK(&conn->c_send_w, rds_send_worker);
+	INIT_DELAYED_WORK(&conn->c_recv_w, rds_recv_worker);
+	INIT_DELAYED_WORK(&conn->c_conn_w, rds_connect_worker);
+	INIT_WORK(&conn->c_down_w, rds_shutdown_worker);
+	mutex_init(&conn->c_cm_lock);
+	conn->c_flags = 0;
+
+	rdsdebug("allocated conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n", conn,
+		 NIPQUAD(laddr), NIPQUAD(faddr));
+
+	rdstrace(RDS_CONNECTION, RDS_MINIMAL,
+	  "allocated conn %p for %u.%u.%u.%u -> %u.%u.%u.%u over %s %s\n",
+	  conn, NIPQUAD(laddr), NIPQUAD(faddr),
+	  trans->t_name ? trans->t_name : "[unknown]",
+	  is_outgoing ? "(outgoing)" : "");
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+	if (parent == NULL) {
+		tmp = rds_conn_lookup(head, laddr, faddr, trans);
+		if (tmp == NULL)
+			hlist_add_head(&conn->c_hash_node, head);
+	} else {
+		tmp = parent->c_passive;
+		if (!tmp)
+			parent->c_passive = conn;
+	}
+
+	if (tmp) {
+		trans->conn_free(conn->c_transport_data);
+		kmem_cache_free(rds_conn_slab, conn);
+		conn = tmp;
+	} else {
+		rds_cong_add_conn(conn);
+		rds_conn_count++;
+	}
+
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+
+out:
+	return conn;
+}
+
+struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp)
+{
+	return __rds_conn_create(laddr, faddr, trans, gfp, 0);
+}
+EXPORT_SYMBOL_GPL(rds_conn_create);
+
+struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr,
+				       struct rds_transport *trans, gfp_t gfp)
+{
+	return __rds_conn_create(laddr, faddr, trans, gfp, 1);
+}
+EXPORT_SYMBOL_GPL(rds_conn_create_outgoing);
+
+void rds_conn_destroy(struct rds_connection *conn)
+{
+	struct rds_message *rm, *rtmp;
+
+	rdsdebug("freeing conn %p for %u.%u.%u.%u -> "
+		 "%u.%u.%u.%u\n", conn, NIPQUAD(conn->c_laddr),
+		 NIPQUAD(conn->c_faddr));
+
+	rdstrace(RDS_CONNECTION, RDS_MINIMAL,
+	  "freeing conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n",
+	  conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr));
+
+	hlist_del_init(&conn->c_hash_node);
+
+	/* wait for the rds thread to shut it down */
+	atomic_set(&conn->c_state, RDS_CONN_ERROR);
+	cancel_delayed_work(&conn->c_conn_w);
+	queue_work(rds_wq, &conn->c_down_w);
+	flush_workqueue(rds_wq);
+
+	/* tear down queued messages */
+	list_for_each_entry_safe(rm, rtmp,
+				 &conn->c_send_queue,
+				 m_conn_item) {
+		list_del_init(&rm->m_conn_item);
+		BUG_ON(!list_empty(&rm->m_sock_item));
+		rds_message_put(rm);
+	}
+	if (conn->c_xmit_rm)
+		rds_message_put(conn->c_xmit_rm);
+
+	conn->c_trans->conn_free(conn->c_transport_data);
+
+	/*
+	 * The congestion maps aren't freed up here.  They're
+	 * freed by rds_cong_exit() after all the connections
+	 * have been freed.
+	 */
+	rds_cong_remove_conn(conn);
+
+	BUG_ON(!list_empty(&conn->c_retrans));
+	kmem_cache_free(rds_conn_slab, conn);
+
+	rds_conn_count--;
+}
+EXPORT_SYMBOL_GPL(rds_conn_destroy);
+
+static void rds_conn_message_info(struct socket *sock, unsigned int len,
+				  struct rds_info_iterator *iter,
+				  struct rds_info_lengths *lens,
+				  int want_send)
+{
+	struct hlist_head *head;
+	struct hlist_node *pos;
+	struct list_head *list;
+	struct rds_connection *conn;
+	struct rds_message *rm;
+	unsigned long flags;
+	unsigned int total = 0;
+	size_t i;
+
+	len /= sizeof(struct rds_info_message);
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+
+	for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash);
+	     i++, head++) {
+		hlist_for_each_entry(conn, pos, head, c_hash_node) {
+			if (want_send)
+				list = &conn->c_send_queue;
+			else
+				list = &conn->c_retrans;
+
+			spin_lock(&conn->c_lock);
+
+			/* XXX too lazy to maintain counts.. */
+			list_for_each_entry(rm, list, m_conn_item) {
+				total++;
+				if (total <= len)
+					rds_inc_info_copy(&rm->m_inc, iter,
+							  conn->c_laddr,
+							  conn->c_faddr, 0);
+			}
+
+			spin_unlock(&conn->c_lock);
+		}
+	}
+
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+
+	lens->nr = total;
+	lens->each = sizeof(struct rds_info_message);
+}
+
+static void rds_conn_message_info_send(struct socket *sock, unsigned int len,
+				       struct rds_info_iterator *iter,
+				       struct rds_info_lengths *lens)
+{
+	rds_conn_message_info(sock, len, iter, lens, 1);
+}
+
+static void rds_conn_message_info_retrans(struct socket *sock,
+					  unsigned int len,
+					  struct rds_info_iterator *iter,
+					  struct rds_info_lengths *lens)
+{
+	rds_conn_message_info(sock, len, iter, lens, 0);
+}
+
+void rds_for_each_conn_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens,
+			  int (*visitor)(struct rds_connection *, void *),
+			  size_t item_len)
+{
+	uint64_t buffer[(item_len + 7) / 8];
+	struct hlist_head *head;
+	struct hlist_node *pos;
+	struct hlist_node *tmp;
+	struct rds_connection *conn;
+	unsigned long flags;
+	size_t i;
+
+	spin_lock_irqsave(&rds_conn_lock, flags);
+
+	lens->nr = 0;
+	lens->each = item_len;
+
+	for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash);
+	     i++, head++) {
+		hlist_for_each_entry_safe(conn, pos, tmp, head, c_hash_node) {
+
+			/* XXX no c_lock usage.. */
+			if (!visitor(conn, buffer))
+				continue;
+
+			/* We copy as much as we can fit in the buffer,
+			 * but we count all items so that the caller
+			 * can resize the buffer. */
+			if (len >= item_len) {
+				rds_info_copy(iter, buffer, item_len);
+				len -= item_len;
+			}
+			lens->nr++;
+		}
+	}
+
+	spin_unlock_irqrestore(&rds_conn_lock, flags);
+}
+EXPORT_SYMBOL_GPL(rds_for_each_conn_info);
+
+static int rds_conn_info_visitor(struct rds_connection *conn,
+				  void *buffer)
+{
+	struct rds_info_connection *cinfo = buffer;
+
+	cinfo->next_tx_seq = conn->c_next_tx_seq;
+	cinfo->next_rx_seq = conn->c_next_rx_seq;
+	cinfo->laddr = conn->c_laddr;
+	cinfo->faddr = conn->c_faddr;
+	strncpy(cinfo->transport, conn->c_trans->t_name,
+		sizeof(cinfo->transport));
+	cinfo->flags = 0;
+
+	rds_conn_info_set(cinfo->flags,
+			  rds_conn_is_sending(conn), SENDING);
+	/* XXX Future: return the state rather than these funky bits */
+	rds_conn_info_set(cinfo->flags,
+			  atomic_read(&conn->c_state) == RDS_CONN_CONNECTING,
+			  CONNECTING);
+	rds_conn_info_set(cinfo->flags,
+			  atomic_read(&conn->c_state) == RDS_CONN_UP,
+			  CONNECTED);
+	return 1;
+}
+
+static void rds_conn_info(struct socket *sock, unsigned int len,
+			  struct rds_info_iterator *iter,
+			  struct rds_info_lengths *lens)
+{
+	rds_for_each_conn_info(sock, len, iter, lens,
+				rds_conn_info_visitor,
+				sizeof(struct rds_info_connection));
+}
+
+int __init rds_conn_init(void)
+{
+	rds_conn_slab = kmem_cache_create("rds_connection",
+					  sizeof(struct rds_connection),
+					  0, 0, NULL);
+	if (rds_conn_slab == NULL)
+		return -ENOMEM;
+
+	rds_info_register_func(RDS_INFO_CONNECTIONS, rds_conn_info);
+	rds_info_register_func(RDS_INFO_SEND_MESSAGES,
+			       rds_conn_message_info_send);
+	rds_info_register_func(RDS_INFO_RETRANS_MESSAGES,
+			       rds_conn_message_info_retrans);
+
+	return 0;
+}
+
+void rds_conn_exit(void)
+{
+	rds_loop_exit();
+
+	WARN_ON(!hlist_empty(rds_conn_hash));
+
+	kmem_cache_destroy(rds_conn_slab);
+
+	rds_info_deregister_func(RDS_INFO_CONNECTIONS, rds_conn_info);
+	rds_info_deregister_func(RDS_INFO_SEND_MESSAGES,
+				 rds_conn_message_info_send);
+	rds_info_deregister_func(RDS_INFO_RETRANS_MESSAGES,
+				 rds_conn_message_info_retrans);
+}
+
+/*
+ * Force a disconnect
+ */
+void rds_conn_drop(struct rds_connection *conn)
+{
+	atomic_set(&conn->c_state, RDS_CONN_ERROR);
+	queue_work(rds_wq, &conn->c_down_w);
+}
+EXPORT_SYMBOL_GPL(rds_conn_drop);
+
+/*
+ * An error occurred on the connection
+ */
+void
+__rds_conn_error(struct rds_connection *conn, const char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	vprintk(fmt, ap);
+	va_end(ap);
+
+	rds_conn_drop(conn);
+}
diff --git a/drivers/infiniband/ulp/rds/threads.c b/drivers/infiniband/ulp/rds/threads.c
new file mode 100644
index 0000000..0017f90
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/threads.c
@@ -0,0 +1,273 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/random.h>
+
+#include "rds.h"
+
+/*
+ * All of connection management is simplified by serializing it through
+ * work queues that execute in a connection managing thread.
+ *
+ * TCP wants to send acks through sendpage() in response to data_ready(),
+ * but it needs a process context to do so.
+ *
+ * The receive paths need to allocate but can't drop packets (!) so we have
+ * a thread around to block allocating if the receive fast path sees an
+ * allocation failure.
+ */
+
+/* Grand Unified Theory of connection life cycle:
+ * At any point in time, the connection can be in one of these states:
+ * DOWN, CONNECTING, UP, DISCONNECTING, ERROR
+ *
+ * The following transitions are possible:
+ *  ANY		  -> ERROR
+ *  UP		  -> DISCONNECTING
+ *  ERROR	  -> DISCONNECTING
+ *  DISCONNECTING -> DOWN
+ *  DOWN	  -> CONNECTING
+ *  CONNECTING	  -> UP
+ *
+ * Transition to state DISCONNECTING/DOWN:
+ *  -	Inside the shutdown worker; synchronizes with xmit path
+ *	through c_send_lock, and with connection management callbacks
+ *	via c_cm_lock.
+ *
+ *	For receive callbacks, we rely on the underlying transport
+ *	(TCP, IB/RDMA) to provide the necessary synchronisation.
+ */
+struct workqueue_struct *rds_wq;
+EXPORT_SYMBOL_GPL(rds_wq);
+
+void rds_connect_complete(struct rds_connection *conn)
+{
+	if (!rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_UP)) {
+		printk(KERN_WARNING "%s: Cannot transition to state UP, "
+				"current state is %d\n",
+				__func__,
+				atomic_read(&conn->c_state));
+		atomic_set(&conn->c_state, RDS_CONN_ERROR);
+		queue_work(rds_wq, &conn->c_down_w);
+		return;
+	}
+
+	rdstrace(RDS_CONNECTION, RDS_MINIMAL,
+	  "conn %p for %u.%u.%u.%u to %u.%u.%u.%u complete\n",
+	  conn, NIPQUAD(conn->c_laddr),NIPQUAD(conn->c_faddr));
+
+	conn->c_reconnect_jiffies = 0;
+	set_bit(0, &conn->c_map_queued);
+	queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+	queue_delayed_work(rds_wq, &conn->c_recv_w, 0);
+}
+EXPORT_SYMBOL_GPL(rds_connect_complete);
+
+/*
+ * This random exponential backoff is relied on to eventually resolve racing
+ * connects.
+ *
+ * If connect attempts race then both parties drop both connections and come
+ * here to wait for a random amount of time before trying again.  Eventually
+ * the backoff range will be so much greater than the time it takes to
+ * establish a connection that one of the pair will establish the connection
+ * before the other's random delay fires.
+ *
+ * Connection attempts that arrive while a connection is already established
+ * are also considered to be racing connects.  This lets a connection from
+ * a rebooted machine replace an existing stale connection before the transport
+ * notices that the connection has failed.
+ *
+ * We should *always* start with a random backoff; otherwise a broken connection
+ * will always take several iterations to be re-established.
+ */
+static void rds_queue_reconnect(struct rds_connection *conn)
+{
+	unsigned long rand;
+
+	rdstrace(RDS_CONNECTION, RDS_LOW,
+	  "conn %p for %u.%u.%u.%u to %u.%u.%u.%u reconnect jiffies %lu\n",
+	  conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr),
+	  conn->c_reconnect_jiffies);
+
+	set_bit(RDS_RECONNECT_PENDING, &conn->c_flags);
+	if (conn->c_reconnect_jiffies == 0) {
+		conn->c_reconnect_jiffies = rds_sysctl_reconnect_min_jiffies;
+		queue_delayed_work(rds_wq, &conn->c_conn_w, 0);
+		return;
+	}
+
+	get_random_bytes(&rand, sizeof(rand));
+	rdsdebug("%lu delay %lu ceil conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n",
+		 rand % conn->c_reconnect_jiffies, conn->c_reconnect_jiffies,
+		 conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr));
+	queue_delayed_work(rds_wq, &conn->c_conn_w,
+			   rand % conn->c_reconnect_jiffies);
+
+	conn->c_reconnect_jiffies = min(conn->c_reconnect_jiffies * 2,
+					rds_sysctl_reconnect_max_jiffies);
+}
+
+void rds_connect_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_conn_w.work);
+	int ret;
+
+	clear_bit(RDS_RECONNECT_PENDING, &conn->c_flags);
+	if (rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) {
+		ret = conn->c_trans->conn_connect(conn);
+		rdsdebug("connect conn %p for %u.%u.%u.%u -> %u.%u.%u.%u "
+			 "ret %d\n", conn, NIPQUAD(conn->c_laddr),
+			 NIPQUAD(conn->c_faddr), ret);
+		rdstrace(RDS_CONNECTION, RDS_MINIMAL,
+		  "conn %p for %u.%u.%u.%u to %u.%u.%u.%u dispatched, ret %d\n",
+		  conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr), ret);
+
+		if (ret) {
+			if (rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_DOWN))
+				rds_queue_reconnect(conn);
+			else
+				rds_conn_error(conn, "RDS: connect failed\n");
+		}
+	}
+}
+
+void rds_shutdown_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_down_w);
+
+	/* shut it down unless it's down already */
+	if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_DOWN)) {
+		/*
+		 * Quiesce the connection mgmt handlers before we start tearing
+		 * things down. We don't hold the mutex for the entire
+		 * duration of the shutdown operation, else we may be
+		 * deadlocking with the CM handler. Instead, the CM event
+		 * handler is supposed to check for state DISCONNECTING
+		 */
+		mutex_lock(&conn->c_cm_lock);
+		if (!rds_conn_transition(conn, RDS_CONN_UP, RDS_CONN_DISCONNECTING)
+		 && !rds_conn_transition(conn, RDS_CONN_ERROR, RDS_CONN_DISCONNECTING)) {
+			rds_conn_error(conn, "shutdown called in state %d\n",
+					atomic_read(&conn->c_state));
+			mutex_unlock(&conn->c_cm_lock);
+			return;
+		}
+		mutex_unlock(&conn->c_cm_lock);
+
+		mutex_lock(&conn->c_send_lock);
+		conn->c_trans->conn_shutdown(conn);
+		rds_conn_reset(conn);
+		mutex_unlock(&conn->c_send_lock);
+
+		if (!rds_conn_transition(conn, RDS_CONN_DISCONNECTING, RDS_CONN_DOWN)) {
+			/* This can happen - eg when we're in the middle of tearing
+			 * down the connection, and someone unloads the rds module.
+			 * Quite reproduceable with loopback connections.
+			 * Mostly harmless.
+			 */
+			rds_conn_error(conn,
+				"%s: failed to transition to state DOWN, "
+				"current state is %d\n",
+				__func__,
+				atomic_read(&conn->c_state));
+			return;
+		}
+	}
+
+	/* Then reconnect if it's still live.
+	 * The passive side of an IB loopback connection is never added
+	 * to the conn hash, so we never trigger a reconnect on this
+	 * conn - the reconnect is always triggered by the active peer. */
+	cancel_delayed_work(&conn->c_conn_w);
+	if (!hlist_unhashed(&conn->c_hash_node))
+		rds_queue_reconnect(conn);
+}
+
+void rds_send_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_send_w.work);
+	int ret;
+
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		ret = rds_send_xmit(conn);
+		rdsdebug("conn %p ret %d\n", conn, ret);
+		switch (ret) {
+		case -EAGAIN:
+			rds_stats_inc(s_send_immediate_retry);
+			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+			break;
+		case -ENOMEM:
+			rds_stats_inc(s_send_delayed_retry);
+			queue_delayed_work(rds_wq, &conn->c_send_w, 2);
+		default:
+			break;
+		}
+	}
+}
+
+void rds_recv_worker(struct work_struct *work)
+{
+	struct rds_connection *conn = container_of(work, struct rds_connection, c_recv_w.work);
+	int ret;
+
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		ret = conn->c_trans->recv(conn);
+		rdsdebug("conn %p ret %d\n", conn, ret);
+		switch (ret) {
+		case -EAGAIN:
+			rds_stats_inc(s_recv_immediate_retry);
+			queue_delayed_work(rds_wq, &conn->c_recv_w, 0);
+			break;
+		case -ENOMEM:
+			rds_stats_inc(s_recv_delayed_retry);
+			queue_delayed_work(rds_wq, &conn->c_recv_w, 2);
+		default:
+			break;
+		}
+	}
+}
+
+void rds_threads_exit(void)
+{
+	destroy_workqueue(rds_wq);
+}
+
+int __init rds_threads_init(void)
+{
+	rds_wq = create_singlethread_workqueue("krdsd");
+	if (rds_wq == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 07/21] RDS: loopback
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (5 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 06/21] RDS: Connection handling Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 08/21] RDS: sysctls Andy Grover
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

A simple rds transport to handle loopback connections.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/loop.c |  189 +++++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/loop.h |    9 ++
 2 files changed, 198 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/loop.c
 create mode 100644 drivers/infiniband/ulp/rds/loop.h

diff --git a/drivers/infiniband/ulp/rds/loop.c b/drivers/infiniband/ulp/rds/loop.c
new file mode 100644
index 0000000..40fa729
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/loop.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+
+#include "rds.h"
+#include "loop.h"
+
+static DEFINE_SPINLOCK(loop_conns_lock);
+static LIST_HEAD(loop_conns);
+
+/*
+ * This 'loopback' transport is a special case for flows that originate
+ * and terminate on the same machine.
+ *
+ * Connection build-up notices if the destination address is thought of
+ * as a local address by a transport.  At that time it decides to use the
+ * loopback transport instead of the bound transport of the sending socket.
+ *
+ * The loopback transport's sending path just hands the sent rds_message
+ * straight to the receiving path via an embedded rds_incoming.
+ */
+
+/*
+ * Usually a message transits both the sender and receiver's conns as it
+ * flows to the receiver.  In the loopback case, though, the receive path
+ * is handed the sending conn so the sense of the addresses is reversed.
+ */
+static int rds_loop_xmit(struct rds_connection *conn, struct rds_message *rm,
+			 unsigned int hdr_off, unsigned int sg,
+			 unsigned int off)
+{
+	BUG_ON(hdr_off || sg || off);
+
+	rds_inc_init(&rm->m_inc, conn, conn->c_laddr);
+	rds_message_addref(rm); /* for the inc */
+
+	rds_recv_incoming(conn, conn->c_laddr, conn->c_faddr, &rm->m_inc,
+			  GFP_KERNEL, KM_USER0);
+
+	rds_send_drop_acked(conn, be64_to_cpu(rm->m_inc.i_hdr.h_sequence),
+			    NULL);
+
+	rds_inc_put(&rm->m_inc);
+
+	return sizeof(struct rds_header) + be32_to_cpu(rm->m_inc.i_hdr.h_len);
+}
+
+static int rds_loop_xmit_cong_map(struct rds_connection *conn,
+				  struct rds_cong_map *map,
+				  unsigned long offset)
+{
+	unsigned long i;
+
+	BUG_ON(offset);
+	BUG_ON(map != conn->c_lcong);
+
+	for (i = 0; i < RDS_CONG_MAP_PAGES; i++) {
+		memcpy((void *)conn->c_fcong->m_page_addrs[i],
+		       (void *)map->m_page_addrs[i], PAGE_SIZE);
+	}
+
+	rds_cong_map_updated(conn->c_fcong, ~(u64) 0);
+
+	return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES;
+}
+
+/* we need to at least give the thread something to succeed */
+static int rds_loop_recv(struct rds_connection *conn)
+{
+	return 0;
+}
+
+struct rds_loop_connection
+{
+	struct list_head loop_node;
+	struct rds_connection *conn;
+};
+
+/*
+ * Even the loopback transport needs to keep track of its connections,
+ * so it can call rds_conn_destroy() on them on exit. N.B. there are
+ * 1+ loopback addresses (127.*.*.*) so it's not a bug to have
+ * multiple loopback conns allocated, although rather useless.
+ */
+static int rds_loop_conn_alloc(struct rds_connection *conn, gfp_t gfp)
+{
+	struct rds_loop_connection *lc;
+	unsigned long flags;
+
+	lc = kzalloc(sizeof(struct rds_loop_connection), GFP_KERNEL);
+	if (lc == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&lc->loop_node);
+	lc->conn = conn;
+	conn->c_transport_data = lc;
+
+	spin_lock_irqsave(&loop_conns_lock, flags);
+	list_add_tail(&lc->loop_node, &loop_conns);
+	spin_unlock_irqrestore(&loop_conns_lock, flags);
+
+	return 0;
+}
+
+static void rds_loop_conn_free(void *arg)
+{
+	struct rds_loop_connection *lc = arg;
+	rdsdebug("lc %p\n", lc);
+	list_del(&lc->loop_node);
+	kfree(lc);
+}
+
+static int rds_loop_conn_connect(struct rds_connection *conn)
+{
+	rds_connect_complete(conn);
+	return 0;
+}
+
+static void rds_loop_conn_shutdown(struct rds_connection *conn)
+{
+}
+
+void rds_loop_exit(void)
+{
+	struct rds_loop_connection *lc, *_lc;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&loop_conns_lock);
+	list_splice(&loop_conns, &tmp_list);
+	INIT_LIST_HEAD(&loop_conns);
+	spin_unlock_irq(&loop_conns_lock);
+
+	list_for_each_entry_safe(lc, _lc, &tmp_list, loop_node) {
+		WARN_ON(lc->conn->c_passive);
+		rds_conn_destroy(lc->conn);
+	}
+}
+
+/*
+ * This is missing .xmit_* because loop doesn't go through generic
+ * rds_send_xmit() and doesn't call rds_recv_incoming().  .listen_stop and
+ * .laddr_check are missing because transport.c doesn't iterate over
+ * rds_loop_transport.
+ */
+struct rds_transport rds_loop_transport = {
+	.xmit			= rds_loop_xmit,
+	.xmit_cong_map		= rds_loop_xmit_cong_map,
+	.recv			= rds_loop_recv,
+	.conn_alloc		= rds_loop_conn_alloc,
+	.conn_free		= rds_loop_conn_free,
+	.conn_connect		= rds_loop_conn_connect,
+	.conn_shutdown		= rds_loop_conn_shutdown,
+	.inc_copy_to_user	= rds_message_inc_copy_to_user,
+	.inc_purge		= rds_message_inc_purge,
+	.inc_free		= rds_message_inc_free,
+	.t_name			= "loopback",
+};
diff --git a/drivers/infiniband/ulp/rds/loop.h b/drivers/infiniband/ulp/rds/loop.h
new file mode 100644
index 0000000..f32b093
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/loop.h
@@ -0,0 +1,9 @@
+#ifndef _RDS_LOOP_H
+#define _RDS_LOOP_H
+
+/* loop.c */
+extern struct rds_transport rds_loop_transport;
+
+void rds_loop_exit(void);
+
+#endif
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 08/21] RDS: sysctls
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (6 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 07/21] RDS: loopback Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 09/21] RDS: Message parsing Andy Grover
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

In addition to some tunable parameters, we also make protocol #
available here, since until accepted in upstream we do not have
a fixed number assigned. This can be removed once upstream.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/sysctl.c |  164 +++++++++++++++++++++++++++++++++++
 1 files changed, 164 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/sysctl.c

diff --git a/drivers/infiniband/ulp/rds/sysctl.c b/drivers/infiniband/ulp/rds/sysctl.c
new file mode 100644
index 0000000..3337a3e
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/sysctl.c
@@ -0,0 +1,164 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/sysctl.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+
+static struct ctl_table_header *rds_sysctl_reg_table;
+
+static unsigned long rds_sysctl_reconnect_min = 1;
+static unsigned long rds_sysctl_reconnect_max = ~0UL;
+
+unsigned long rds_sysctl_reconnect_min_jiffies;
+unsigned long rds_sysctl_reconnect_max_jiffies = HZ;
+
+unsigned int  rds_sysctl_max_unacked_packets = 8;
+unsigned int  rds_sysctl_max_unacked_bytes = (16 << 20);
+
+unsigned int rds_sysctl_ping_enable = 1;
+
+unsigned long rds_sysctl_trace_flags = 0;
+unsigned int  rds_sysctl_trace_level = 0;
+
+/* 
+ * These can change over time until they're official.  Until that time we'll
+ * give apps a way to figure out what the values are in a given machine.
+ */
+static int rds_sysctl_pf_rds = PF_RDS;
+static int rds_sysctl_sol_rds = SOL_RDS;
+
+static ctl_table rds_sysctl_rds_table[] = {
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "reconnect_min_delay_ms",
+		.data		= &rds_sysctl_reconnect_min_jiffies,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &rds_sysctl_reconnect_min,
+		.extra2		= &rds_sysctl_reconnect_max_jiffies,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "reconnect_max_delay_ms",
+		.data		= &rds_sysctl_reconnect_max_jiffies,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &rds_sysctl_reconnect_min_jiffies,
+		.extra2		= &rds_sysctl_reconnect_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "pf_rds",
+		.data		= &rds_sysctl_pf_rds,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = &proc_dointvec,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "sol_rds",
+		.data		= &rds_sysctl_sol_rds,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_unacked_packets",
+		.data		= &rds_sysctl_max_unacked_packets,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_unacked_bytes",
+		.data		= &rds_sysctl_max_unacked_bytes,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "ping_enable",
+		.data		= &rds_sysctl_ping_enable,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+	},
+        {
+                .ctl_name       = CTL_UNNUMBERED,
+                .procname       = "trace_flags",
+                .data           = &rds_sysctl_trace_flags,
+                .maxlen         = sizeof(unsigned long),
+                .mode           = 0644,
+                .proc_handler   = &proc_dointvec,
+        },
+        {
+                .ctl_name       = CTL_UNNUMBERED,
+                .procname       = "trace_level",
+                .data           = &rds_sysctl_trace_level,
+                .maxlen         = sizeof(unsigned int),
+                .mode           = 0644,
+                .proc_handler   = &proc_dointvec,
+        },
+	{ .ctl_name = 0}
+};
+
+static struct ctl_path rds_sysctl_path[] = {
+	{ .procname = "net", .ctl_name = CTL_NET, },
+	{ .procname = "rds", .ctl_name = CTL_UNNUMBERED, },
+	{ }
+};
+
+
+void rds_sysctl_exit(void)
+{
+	if (rds_sysctl_reg_table)
+		unregister_sysctl_table(rds_sysctl_reg_table);
+}
+
+int __init rds_sysctl_init(void)
+{
+	rds_sysctl_reconnect_min = msecs_to_jiffies(1);
+	rds_sysctl_reconnect_min_jiffies = rds_sysctl_reconnect_min;
+
+	rds_sysctl_reg_table = register_sysctl_paths(rds_sysctl_path, rds_sysctl_rds_table);
+	if (rds_sysctl_reg_table == NULL)
+		return -ENOMEM;
+	return 0;
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 09/21] RDS: Message parsing
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (7 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 08/21] RDS: sysctls Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 10/21] RDS: send.c Andy Grover
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Parsing of newly-received RDS message headers (including ext.
headers) and copy-to/from-user routines.

page.c implements a per-cpu page remainder cache, to reduce the
number of allocations needed for small datagrams.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/message.c |  414 ++++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/page.c    |  222 ++++++++++++++++++
 2 files changed, 636 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/message.c
 create mode 100644 drivers/infiniband/ulp/rds/page.c

diff --git a/drivers/infiniband/ulp/rds/message.c b/drivers/infiniband/ulp/rds/message.c
new file mode 100644
index 0000000..5cad4d5
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/message.c
@@ -0,0 +1,414 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "rdma.h"
+
+static DECLARE_WAIT_QUEUE_HEAD(rds_message_flush_waitq);
+
+static unsigned int	rds_exthdr_size[__RDS_EXTHDR_MAX] = {
+[RDS_EXTHDR_NONE]	= 0,
+[RDS_EXTHDR_VERSION]	= sizeof(struct rds_ext_header_version),
+[RDS_EXTHDR_RDMA]	= sizeof(struct rds_ext_header_rdma),
+[RDS_EXTHDR_RDMA_DEST]	= sizeof(struct rds_ext_header_rdma_dest),
+};
+
+
+void rds_message_addref(struct rds_message *rm)
+{
+	rdsdebug("addref rm %p ref %d\n", rm, atomic_read(&rm->m_refcount));
+	atomic_inc(&rm->m_refcount);
+}
+EXPORT_SYMBOL_GPL(rds_message_addref);
+
+/*
+ * This relies on dma_map_sg() not touching sg[].page during merging.
+ */
+static void rds_message_purge(struct rds_message *rm)
+{
+	unsigned long i;
+
+	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
+		return;
+
+	for (i = 0; i < rm->m_nents; i++) {
+		rdsdebug("putting data page %p\n", (void *)sg_page(&rm->m_sg[i]));
+		/* XXX will have to put_page for page refs */
+		__free_page(sg_page(&rm->m_sg[i]));
+	}
+	rm->m_nents = 0;
+
+	if (rm->m_rdma_op)
+		rds_rdma_free_op(rm->m_rdma_op);
+	if (rm->m_rdma_mr)
+		rds_mr_put(rm->m_rdma_mr);
+}
+
+void rds_message_inc_purge(struct rds_incoming *inc)
+{
+	struct rds_message *rm = container_of(inc, struct rds_message, m_inc);
+	rds_message_purge(rm);
+}
+
+void rds_message_put(struct rds_message *rm)
+{
+	rdsdebug("put rm %p ref %d\n", rm, atomic_read(&rm->m_refcount));
+
+	if (atomic_dec_and_test(&rm->m_refcount)) {
+		BUG_ON(!list_empty(&rm->m_sock_item));
+		BUG_ON(!list_empty(&rm->m_conn_item));
+		rds_message_purge(rm);
+
+		kfree(rm);
+	}
+}
+EXPORT_SYMBOL_GPL(rds_message_put);
+
+void rds_message_inc_free(struct rds_incoming *inc)
+{
+	struct rds_message *rm = container_of(inc, struct rds_message, m_inc);
+	rds_message_put(rm);
+}
+
+void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
+				 __be16 dport, u64 seq)
+{
+	hdr->h_flags = 0;
+	hdr->h_sport = sport;
+	hdr->h_dport = dport;
+	hdr->h_sequence = cpu_to_be64(seq);
+	hdr->h_exthdr[0] = RDS_EXTHDR_NONE;
+}
+EXPORT_SYMBOL_GPL(rds_message_populate_header);
+
+int rds_message_add_extension(struct rds_header *hdr,
+		unsigned int type, const void *data, unsigned int len)
+{
+	unsigned int ext_len = sizeof(u8) + len;
+	unsigned char *dst;
+
+	/* For now, refuse to add more than one extension header */
+	if (hdr->h_exthdr[0] != RDS_EXTHDR_NONE)
+		return 0;
+
+	if (type >= __RDS_EXTHDR_MAX
+	 || len != rds_exthdr_size[type])
+		return 0;
+
+	if (ext_len >= RDS_HEADER_EXT_SPACE)
+		return 0;
+	dst = hdr->h_exthdr;
+
+	*dst++ = type;
+	memcpy(dst, data, len);
+
+	dst[len] = RDS_EXTHDR_NONE;
+	return 1;
+}
+EXPORT_SYMBOL_GPL(rds_message_add_extension);
+
+/*
+ * If a message has extension headers, retrieve them here.
+ * Call like this:
+ *
+ * unsigned int pos = 0;
+ *
+ * while (1) {
+ *	buflen = sizeof(buffer);
+ *	type = rds_message_next_extension(hdr, &pos, buffer, &buflen);
+ *	if (type == RDS_EXTHDR_NONE)
+ *		break;
+ *	...
+ * }
+ */
+int rds_message_next_extension(struct rds_header *hdr,
+		unsigned int *pos, void *buf, unsigned int *buflen)
+{
+	unsigned int offset, ext_type, ext_len;
+	u8 *src = hdr->h_exthdr;
+
+	offset = *pos;
+	if (offset >= RDS_HEADER_EXT_SPACE)
+		goto none;
+
+	/* Get the extension type and length. For now, the
+	 * length is implied by the extension type. */
+	ext_type = src[offset++];
+
+	if (ext_type == RDS_EXTHDR_NONE || ext_type >= __RDS_EXTHDR_MAX)
+		goto none;
+	ext_len = rds_exthdr_size[ext_type];
+	if (offset + ext_len > RDS_HEADER_EXT_SPACE)
+		goto none;
+
+	*pos = offset + ext_len;
+	if (ext_len < *buflen)
+		*buflen = ext_len;
+	memcpy(buf, src + offset, *buflen);
+	return ext_type;
+
+none:
+	*pos = RDS_HEADER_EXT_SPACE;
+	*buflen = 0;
+	return RDS_EXTHDR_NONE;
+}
+
+int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version)
+{
+	struct rds_ext_header_version ext_hdr;
+
+	ext_hdr.h_version = cpu_to_be32(version);
+	return rds_message_add_extension(hdr, RDS_EXTHDR_VERSION, &ext_hdr, sizeof(ext_hdr));
+}
+
+int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version)
+{
+	struct rds_ext_header_version ext_hdr;
+	unsigned int pos = 0, len = sizeof(ext_hdr);
+
+	/* We assume the version extension is the only one present */
+	if (rds_message_next_extension(hdr, &pos, &ext_hdr, &len) != RDS_EXTHDR_VERSION)
+		return 0;
+	*version = be32_to_cpu(ext_hdr.h_version);
+	return 1;
+}
+
+int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset)
+{
+	struct rds_ext_header_rdma_dest ext_hdr;
+
+	ext_hdr.h_rdma_rkey = cpu_to_be32(r_key);
+	ext_hdr.h_rdma_offset = cpu_to_be32(offset);
+	return rds_message_add_extension(hdr, RDS_EXTHDR_RDMA_DEST, &ext_hdr, sizeof(ext_hdr));
+}
+EXPORT_SYMBOL_GPL(rds_message_add_rdma_dest_extension);
+
+struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp)
+{
+	struct rds_message *rm;
+
+	rm = kzalloc(sizeof(struct rds_message) +
+		     (nents * sizeof(struct scatterlist)), gfp);
+	if (!rm)
+		goto out;
+
+#ifdef CONFIG_DEBUG_SG
+{
+	unsigned int i;
+
+	for (i=0; i < nents; i++)
+		rm->m_sg[i].sg_magic = SG_MAGIC;
+}
+#endif
+	atomic_set(&rm->m_refcount, 1);
+	INIT_LIST_HEAD(&rm->m_sock_item);
+	INIT_LIST_HEAD(&rm->m_conn_item);
+	spin_lock_init(&rm->m_rs_lock);
+
+out:
+	return rm;
+}
+
+struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len)
+{
+	struct rds_message *rm;
+	unsigned int i;
+
+	rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL);
+	if (rm == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	set_bit(RDS_MSG_PAGEVEC, &rm->m_flags);
+	rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len);
+	rm->m_nents = ceil(total_len, PAGE_SIZE);
+
+	for (i = 0; i < rm->m_nents; ++i) {
+		sg_set_page(&rm->m_sg[i],
+				virt_to_page(page_addrs[i]),
+				PAGE_SIZE, 0);
+	}
+
+	return rm;
+}
+
+struct rds_message *rds_message_copy_from_user(struct iovec *first_iov,
+					       size_t total_len)
+{
+	unsigned long to_copy;
+	unsigned long iov_off;
+	unsigned long sg_off;
+	struct rds_message *rm;
+	struct iovec *iov;
+	struct scatterlist *sg;
+	int ret;
+
+	rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL);
+	if (rm == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len);
+
+	/*
+	 * now allocate and copy in the data payload.
+	 */
+	sg = rm->m_sg;
+	iov = first_iov;
+	iov_off = 0;
+	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
+
+	while (total_len) {
+		if (sg_page(sg) == NULL) {
+			ret = rds_page_remainder_alloc(sg, total_len,
+						       GFP_HIGHUSER);
+			if (ret)
+				goto out;
+			rm->m_nents++;
+			sg_off = 0;
+		}
+
+		while (iov_off == iov->iov_len) {
+			iov_off = 0;
+			iov++;
+		}
+
+		to_copy = min(iov->iov_len - iov_off, sg->length - sg_off);
+		to_copy = min_t(size_t, to_copy, total_len);
+
+		rdsdebug("copying %lu bytes from user iov [%p, %zu] + %lu to "
+			 "sg [%p, %u, %u] + %lu\n",
+			 to_copy, iov->iov_base, iov->iov_len, iov_off,
+			 (void *)sg_page(sg), sg->offset, sg->length, sg_off);
+
+		ret = rds_page_copy_from_user(sg_page(sg), sg->offset + sg_off,
+					      iov->iov_base + iov_off,
+					      to_copy);
+		if (ret)
+			goto out;
+
+		iov_off += to_copy;
+		total_len -= to_copy;
+		sg_off += to_copy;
+
+		if (sg_off == sg->length)
+			sg++;
+	}
+
+	ret = 0;
+out:
+	if (ret) {
+		if (rm)
+			rds_message_put(rm);
+		rm = ERR_PTR(ret);
+	}
+	return rm;
+}
+
+int rds_message_inc_copy_to_user(struct rds_incoming *inc,
+				 struct iovec *first_iov, size_t size)
+{
+	struct rds_message *rm;
+	struct iovec *iov;
+	struct scatterlist *sg;
+	unsigned long to_copy;
+	unsigned long iov_off;
+	unsigned long vec_off;
+	int copied;
+	int ret;
+	u32 len;
+
+	rm = container_of(inc, struct rds_message, m_inc);
+	len = be32_to_cpu(rm->m_inc.i_hdr.h_len);
+
+	iov = first_iov;
+	iov_off = 0;
+	sg = rm->m_sg;
+	vec_off = 0;
+	copied = 0;
+
+	while (copied < size && copied < len) {
+		while (iov_off == iov->iov_len) {
+			iov_off = 0;
+			iov++;
+		}
+
+		to_copy = min(iov->iov_len - iov_off, sg->length - vec_off);
+		to_copy = min_t(size_t, to_copy, size - copied);
+		to_copy = min_t(unsigned long, to_copy, len - copied);
+
+		rdsdebug("copying %lu bytes to user iov [%p, %zu] + %lu to "
+			 "sg [%p, %u, %u] + %lu\n",
+			 to_copy, iov->iov_base, iov->iov_len, iov_off,
+			 sg_page(sg), sg->offset, sg->length, vec_off);
+
+		ret = rds_page_copy_to_user(sg_page(sg), sg->offset + vec_off,
+					    iov->iov_base + iov_off,
+					    to_copy);
+		if (ret) {
+			copied = ret;
+			break;
+		}
+
+		iov_off += to_copy;
+		vec_off += to_copy;
+		copied += to_copy;
+
+		if (vec_off == sg->length) {
+			vec_off = 0;
+			sg++;
+		}
+	}
+
+	return copied;
+}
+
+/*
+ * If the message is still on the send queue, wait until the transport
+ * is done with it. This is particularly important for RDMA operations.
+ */
+void rds_message_wait(struct rds_message *rm)
+{
+	wait_event(rds_message_flush_waitq,
+			!test_bit(RDS_MSG_MAPPED, &rm->m_flags));
+}
+
+void rds_message_unmapped(struct rds_message *rm)
+{
+	clear_bit(RDS_MSG_MAPPED, &rm->m_flags);
+	if (waitqueue_active(&rds_message_flush_waitq))
+		wake_up(&rds_message_flush_waitq);
+}
+EXPORT_SYMBOL_GPL(rds_message_unmapped);
+
diff --git a/drivers/infiniband/ulp/rds/page.c b/drivers/infiniband/ulp/rds/page.c
new file mode 100644
index 0000000..55c21ef
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/page.c
@@ -0,0 +1,222 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/highmem.h>
+
+#include "rds.h"
+
+struct rds_page_remainder {
+	struct page	*r_page;
+	unsigned long	r_offset;
+};
+
+DEFINE_PER_CPU(struct rds_page_remainder, rds_page_remainders) ____cacheline_aligned;
+
+/*
+ * returns 0 on success or -errno on failure.
+ *
+ * We don't have to worry about flush_dcache_page() as this only works
+ * with private pages.  If, say, we were to do directed receive to pinned
+ * user pages we'd have to worry more about cache coherence.  (Though
+ * the flush_dcache_page() in get_user_pages() would probably be enough).
+ */
+int rds_page_copy_user(struct page *page, unsigned long offset,
+		       void __user *ptr, unsigned long bytes,
+		       int to_user)
+{
+	unsigned long ret;
+	void *addr;
+
+	if (to_user)
+		rds_stats_add(s_copy_to_user, bytes);
+	else
+		rds_stats_add(s_copy_from_user, bytes);
+
+	addr = kmap_atomic(page, KM_USER0);
+	if (to_user)
+		ret = __copy_to_user_inatomic(ptr, addr + offset, bytes);
+	else
+		ret = __copy_from_user_inatomic(addr + offset, ptr, bytes);
+	kunmap_atomic(addr, KM_USER0);
+
+	if (ret) {
+		addr = kmap(page);
+		if (to_user)
+			ret = copy_to_user(ptr, addr + offset, bytes);
+		else
+			ret = copy_from_user(addr + offset, ptr, bytes);
+		kunmap(page);
+		if (ret)
+			return -EFAULT;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(rds_page_copy_user);
+
+/*
+ * Message allocation uses this to build up regions of a message.
+ *
+ * @bytes - the number of bytes needed.
+ * @gfp - the waiting behaviour of the allocation
+ *
+ * @gfp is always ored with __GFP_HIGHMEM.  Callers must be prepared to
+ * kmap the pages, etc.
+ *
+ * If @bytes is at least a full page then this just returns a page from
+ * alloc_page().
+ *
+ * If @bytes is a partial page then this stores the unused region of the
+ * page in a per-cpu structure.  Future partial-page allocations may be
+ * satisfied from that cached region.  This lets us waste less memory on
+ * small allocations with minimal complexity.  It works because the transmit
+ * path passes read-only page regions down to devices.  They hold a page
+ * reference until they are done with the region.
+ */
+int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes,
+			     gfp_t gfp)
+{
+	struct rds_page_remainder *rem;
+	unsigned long flags;
+	struct page *page;
+	int ret;
+
+	gfp |= __GFP_HIGHMEM;
+
+	/* jump straight to allocation if we're trying for a huge page */
+	if (bytes >= PAGE_SIZE) {
+		page = alloc_page(gfp);
+		if (page == NULL) {
+			ret = -ENOMEM;
+		} else {
+			sg_set_page(scat, page, PAGE_SIZE, 0);
+			ret = 0;
+		}
+		goto out;
+	}
+
+	rem = &per_cpu(rds_page_remainders, get_cpu());
+	local_irq_save(flags);
+
+	while (1) {
+		/* avoid a tiny region getting stuck by tossing it */
+		if (rem->r_page && bytes > (PAGE_SIZE - rem->r_offset)) {
+			rds_stats_inc(s_page_remainder_miss);
+			__free_page(rem->r_page);
+			rem->r_page = NULL;
+		}
+
+		/* hand out a fragment from the cached page */
+		if (rem->r_page && bytes <= (PAGE_SIZE - rem->r_offset)) {
+			sg_set_page(scat, rem->r_page, bytes, rem->r_offset);
+			get_page(sg_page(scat));
+
+			if (rem->r_offset != 0)
+				rds_stats_inc(s_page_remainder_hit);
+
+			rem->r_offset += bytes;
+			if (rem->r_offset == PAGE_SIZE) {
+				__free_page(rem->r_page);
+				rem->r_page = NULL;
+			}
+			ret = 0;
+			break;
+		}
+
+		/* alloc if there is nothing for us to use */
+		local_irq_restore(flags);
+		put_cpu();
+
+		page = alloc_page(gfp);
+
+		rem = &per_cpu(rds_page_remainders, get_cpu());
+		local_irq_save(flags);
+
+		if (page == NULL) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		/* did someone race to fill the remainder before us? */
+		if (rem->r_page) {
+			__free_page(page);
+			continue;
+		}
+
+		/* otherwise install our page and loop around to alloc */
+		rem->r_page = page;
+		rem->r_offset = 0;
+	}
+
+	local_irq_restore(flags);
+	put_cpu();
+out:
+	rdsdebug("bytes %lu ret %d %p %u %u\n", bytes, ret,
+		 ret ? NULL : sg_page(scat), ret ? 0 : scat->offset,
+		 ret ? 0 : scat->length);
+	return ret;
+}
+
+static int rds_page_remainder_cpu_notify(struct notifier_block *self,
+					 unsigned long action, void *hcpu)
+{
+	struct rds_page_remainder *rem;
+	long cpu = (long)hcpu;
+
+	rem = &per_cpu(rds_page_remainders, cpu);
+
+	rdsdebug("cpu %ld action 0x%lx\n", cpu, action);
+
+	switch (action) {
+	case CPU_DEAD:
+		if (rem->r_page)
+			__free_page(rem->r_page);
+		rem->r_page = NULL;
+		break;
+	}
+
+	return 0;
+}
+
+static struct notifier_block rds_page_remainder_nb = {
+	.notifier_call = rds_page_remainder_cpu_notify,
+};
+
+void rds_page_exit(void)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		rds_page_remainder_cpu_notify(&rds_page_remainder_nb,
+					      (unsigned long)CPU_DEAD,
+					      (void *)(long)i);
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/21] RDS: send.c
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (8 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 09/21] RDS: Message parsing Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 11/21] RDS: recv.c Andy Grover
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

This is the code to send an RDS datagram.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/send.c | 1006 +++++++++++++++++++++++++++++++++++++
 1 files changed, 1006 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/send.c

diff --git a/drivers/infiniband/ulp/rds/send.c b/drivers/infiniband/ulp/rds/send.c
new file mode 100644
index 0000000..276f7ac
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/send.c
@@ -0,0 +1,1006 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <net/sock.h>
+#include <linux/in.h>
+#include <linux/list.h>
+
+#include "rds.h"
+#include "rdma.h"
+
+/* When transmitting messages in rds_send_xmit, we need to emerge from
+ * time to time and briefly release the CPU. Otherwise the softlock watchdog
+ * will kick our shin.
+ * Also, it seems fairer to not let one busy connection stall all the
+ * others.
+ *
+ * send_batch_count is the number of times we'll loop in send_xmit. Setting
+ * it to 0 will restore the old behavior (where we looped until we had
+ * drained the queue).
+ */
+static int send_batch_count = 64;
+module_param(send_batch_count, int, 0444);
+MODULE_PARM_DESC(send_batch_count, " batch factor when working the send queue");
+
+/*
+ * Reset the send state. Caller must hold c_send_lock when calling here.
+ */
+void rds_send_reset(struct rds_connection *conn)
+{
+	struct rds_message *rm, *tmp;
+	unsigned long flags;
+
+	if (conn->c_xmit_rm) {
+		/* Tell the user the RDMA op is no longer mapped by the
+		 * transport. This isn't entirely true (it's flushed out
+		 * independently) but as the connection is down, there's
+		 * no ongoing RDMA to/from that memory */
+		rds_message_unmapped(conn->c_xmit_rm);
+		rds_message_put(conn->c_xmit_rm);
+		conn->c_xmit_rm = NULL;
+	}
+	conn->c_xmit_sg = 0;
+	conn->c_xmit_hdr_off = 0;
+	conn->c_xmit_data_off = 0;
+	conn->c_xmit_rdma_sent = 0;
+
+	conn->c_map_queued = 0;
+
+	conn->c_unacked_packets = rds_sysctl_max_unacked_packets;
+	conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes;
+
+	/* Mark messages as retransmissions, and move them to the send q */
+	spin_lock_irqsave(&conn->c_lock, flags);
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+		set_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags);
+	}
+	list_splice_init(&conn->c_retrans, &conn->c_send_queue);
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+}
+
+/*
+ * We're making the concious trade-off here to only send one message
+ * down the connection at a time.
+ *   Pro:
+ *      - tx queueing is a simple fifo list
+ *   	- reassembly is optional and easily done by transports per conn
+ *      - no per flow rx lookup at all, straight to the socket
+ *   	- less per-frag memory and wire overhead
+ *   Con:
+ *      - queued acks can be delayed behind large messages
+ *   Depends:
+ *      - small message latency is higher behind queued large messages
+ *      - large message latency isn't starved by intervening small sends
+ */
+int rds_send_xmit(struct rds_connection *conn)
+{
+	struct rds_message *rm;
+	unsigned long flags;
+	unsigned int tmp;
+	unsigned int send_quota = send_batch_count;
+	struct scatterlist *sg;
+	int ret = 0;
+	int was_empty = 0;
+	LIST_HEAD(to_be_dropped);
+
+	/*
+	 * sendmsg calls here after having queued its message on the send
+	 * queue.  We only have one task feeding the connection at a time.  If
+	 * another thread is already feeding the queue then we back off.  This
+	 * avoids blocking the caller and trading per-connection data between
+	 * caches per message.
+	 *
+	 * The sem holder will issue a retry if they notice that someone queued
+	 * a message after they stopped walking the send queue but before they
+	 * dropped the sem.
+	 */
+	if (!mutex_trylock(&conn->c_send_lock)) {
+		rds_stats_inc(s_send_sem_contention);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (conn->c_trans->xmit_prepare)
+		conn->c_trans->xmit_prepare(conn);
+
+	/*
+	 * spin trying to push headers and data down the connection until
+	 * the connection doens't make forward progress.
+	 */
+	while (--send_quota) {
+		/*
+		 * See if need to send a congestion map update if we're
+		 * between sending messages.  The send_sem protects our sole
+		 * use of c_map_offset and _bytes.
+		 * Note this is used only by transports that define a special
+		 * xmit_cong_map function. For all others, we create allocate
+		 * a cong_map message and treat it just like any other send.
+		 */
+		if (conn->c_map_bytes) {
+			ret = conn->c_trans->xmit_cong_map(conn, conn->c_lcong,
+						conn->c_map_offset);
+			if (ret <= 0)
+				break;
+
+			conn->c_map_offset += ret;
+			conn->c_map_bytes -= ret;
+			if (conn->c_map_bytes)
+				continue;
+		}
+
+		/* If we're done sending the current message, clear the
+		 * offset and S/G temporaries.
+		 */
+		rm = conn->c_xmit_rm;
+		if (rm != NULL &&
+		    conn->c_xmit_hdr_off == sizeof(struct rds_header) &&
+		    conn->c_xmit_sg == rm->m_nents) {
+			conn->c_xmit_rm = NULL;
+			conn->c_xmit_sg = 0;
+			conn->c_xmit_hdr_off = 0;
+			conn->c_xmit_data_off = 0;
+			conn->c_xmit_rdma_sent = 0;
+
+			/* Release the reference to the previous message. */
+			rds_message_put(rm);
+			rm = NULL;
+		}
+
+		/* If we're asked to send a cong map update, do so.
+		 */
+		if (rm == NULL && test_and_clear_bit(0, &conn->c_map_queued)) {
+			if (conn->c_trans->xmit_cong_map != NULL) {
+				conn->c_map_offset = 0;
+				conn->c_map_bytes = sizeof(struct rds_header) +
+					RDS_CONG_MAP_BYTES;
+				continue;
+			}
+
+			rm = rds_cong_update_alloc(conn);
+			if (IS_ERR(rm)) {
+				ret = PTR_ERR(rm);
+				break;
+			}
+
+			conn->c_xmit_rm = rm;
+		}
+
+		/*
+		 * Grab the next message from the send queue, if there is one.
+		 *
+		 * c_xmit_rm holds a ref while we're sending this message down
+		 * the connction.  We can use this ref while holding the
+		 * send_sem.. rds_send_reset() is serialized with it.
+		 */
+		if (rm == NULL) {
+			unsigned int len;
+
+			spin_lock_irqsave(&conn->c_lock, flags);
+
+			if (!list_empty(&conn->c_send_queue)) {
+				rm = list_entry(conn->c_send_queue.next,
+						struct rds_message,
+						m_conn_item);
+				rds_message_addref(rm);
+
+				/*
+				 * Move the message from the send queue to the retransmit
+				 * list right away.
+				 */
+				list_move_tail(&rm->m_conn_item, &conn->c_retrans);
+			}
+
+			spin_unlock_irqrestore(&conn->c_lock, flags);
+
+			if (rm == NULL) {
+				was_empty = 1;
+				break;
+			}
+
+			/* Unfortunately, the way Infiniband deals with
+			 * RDMA to a bad MR key is by moving the entire
+			 * queue pair to error state. We cold possibly
+			 * recover from that, but right now we drop the
+			 * connection.
+			 * Therefore, we never retransmit messages with RDMA ops.
+			 */
+			if (rm->m_rdma_op
+			 && test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags)) {
+				spin_lock_irqsave(&conn->c_lock, flags);
+				if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags))
+					list_move(&rm->m_conn_item, &to_be_dropped);
+				spin_unlock_irqrestore(&conn->c_lock, flags);
+				rds_message_put(rm);
+				continue;
+			}
+
+			/* Require an ACK every once in a while */
+			len = ntohl(rm->m_inc.i_hdr.h_len);
+			if (conn->c_unacked_packets == 0
+			 || conn->c_unacked_bytes < len) {
+				__set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+
+				conn->c_unacked_packets = rds_sysctl_max_unacked_packets;
+				conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes;
+				rds_stats_inc(s_send_ack_required);
+			} else {
+				conn->c_unacked_bytes -= len;
+				conn->c_unacked_packets--;
+			}
+
+			conn->c_xmit_rm = rm;
+		}
+
+		/*
+		 * Try and send an rdma message.  Let's see if we can
+		 * keep this simple and require that the transport either
+		 * send the whole rdma or none of it.
+		 */
+		if (rm->m_rdma_op && !conn->c_xmit_rdma_sent) {
+			ret = conn->c_trans->xmit_rdma(conn, rm->m_rdma_op);
+			if (ret)
+				break;
+			conn->c_xmit_rdma_sent = 1;
+			/* The transport owns the mapped memory for now.
+			 * You can't unmap it while it's on the send queue */
+			set_bit(RDS_MSG_MAPPED, &rm->m_flags);
+		}
+
+		if (conn->c_xmit_hdr_off < sizeof(struct rds_header) ||
+		    conn->c_xmit_sg < rm->m_nents) {
+			ret = conn->c_trans->xmit(conn, rm,
+						  conn->c_xmit_hdr_off,
+						  conn->c_xmit_sg,
+						  conn->c_xmit_data_off);
+			if (ret <= 0)
+				break;
+
+			if (conn->c_xmit_hdr_off < sizeof(struct rds_header)) {
+				tmp = min_t(int, ret,
+					    sizeof(struct rds_header) -
+					    conn->c_xmit_hdr_off);
+				conn->c_xmit_hdr_off += tmp;
+				ret -= tmp;
+			}
+
+			sg = &rm->m_sg[conn->c_xmit_sg];
+			while (ret) {
+				tmp = min_t(int, ret, sg->length -
+						      conn->c_xmit_data_off);
+				conn->c_xmit_data_off += tmp;
+				ret -= tmp;
+				if (conn->c_xmit_data_off == sg->length) {
+					conn->c_xmit_data_off = 0;
+					sg++;
+					conn->c_xmit_sg++;
+					BUG_ON(ret != 0 &&
+					       conn->c_xmit_sg == rm->m_nents);
+				}
+			}
+		}
+	}
+
+	/* Nuke any messages we decided not to retransmit. */
+	if (!list_empty(&to_be_dropped))
+		rds_send_remove_from_sock(&to_be_dropped, RDS_RDMA_DROPPED);
+
+	if (conn->c_trans->xmit_complete)
+		conn->c_trans->xmit_complete(conn);
+
+	/*
+	 * We might be racing with another sender who queued a message but
+	 * backed off on noticing that we held the c_send_lock.  If we check
+	 * for queued messages after dropping the sem then either we'll
+	 * see the queued message or the queuer will get the sem.  If we
+	 * notice the queued message then we trigger an immediate retry.
+	 *
+	 * We need to be careful only to do this when we stopped processing
+	 * the send queue because it was empty.  It's the only way we
+	 * stop processing the loop when the transport hasn't taken
+	 * responsibility for forward progress.
+	 */
+	mutex_unlock(&conn->c_send_lock);
+
+	if (conn->c_map_bytes || (send_quota == 0 && !was_empty)) {
+		/* We exhausted the send quota, but there's work left to
+		 * do. Return and (re-)schedule the send worker.
+		 */
+		ret = -EAGAIN;
+	}
+
+	if (ret == 0 && was_empty) {
+		/* A simple bit test would be way faster than taking the
+		 * spin lock */
+		spin_lock_irqsave(&conn->c_lock, flags);
+		if (!list_empty(&conn->c_send_queue)) {
+			rds_stats_inc(s_send_sem_queue_raced);
+			ret = -EAGAIN;
+		}
+		spin_unlock_irqrestore(&conn->c_lock, flags);
+	}
+out:
+	return ret;
+}
+
+static void rds_send_sndbuf_remove(struct rds_sock *rs, struct rds_message *rm)
+{
+	u32 len = be32_to_cpu(rm->m_inc.i_hdr.h_len);
+
+	assert_spin_locked(&rs->rs_lock);
+
+	BUG_ON(rs->rs_snd_bytes < len);
+	rs->rs_snd_bytes -= len;
+
+	if (rs->rs_snd_bytes == 0)
+		rds_stats_inc(s_send_queue_empty);
+}
+
+static inline int rds_send_is_acked(struct rds_message *rm, u64 ack,
+				    is_acked_func is_acked)
+{
+	if (is_acked)
+		return is_acked(rm, ack);
+	return be64_to_cpu(rm->m_inc.i_hdr.h_sequence) <= ack;
+}
+
+/*
+ * Returns true if there are no messages on the send and retransmit queues
+ * which have a sequence number greater than or equal to the given sequence
+ * number.
+ */
+int rds_send_acked_before(struct rds_connection *conn, u64 seq)
+{
+	struct rds_message *rm, *tmp;
+	int ret = 1;
+
+	spin_lock(&conn->c_lock);
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq)
+			ret = 0;
+		break;
+	}
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) {
+		if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq)
+			ret = 0;
+		break;
+	}
+
+	spin_unlock(&conn->c_lock);
+
+	return ret;
+}
+
+/*
+ * This is pretty similar to what happens below in the ACK
+ * handling code - except that we call here as soon as we get
+ * the IB send completion on the RDMA op and the accompanying
+ * message.
+ */
+void rds_rdma_send_complete(struct rds_message *rm, int status)
+{
+	struct rds_sock *rs = NULL;
+	struct rds_rdma_op *ro;
+	struct rds_notifier *notifier;
+
+	spin_lock(&rm->m_rs_lock);
+
+	ro = rm->m_rdma_op;
+	if (test_bit(RDS_MSG_ON_SOCK, &rm->m_flags)
+	 && ro && ro->r_notify
+	 && (notifier = ro->r_notifier) != NULL) {
+		rs = rm->m_rs;
+		sock_hold(rds_rs_to_sk(rs));
+
+		notifier->n_status = status;
+		spin_lock(&rs->rs_lock);
+		list_add_tail(&notifier->n_list, &rs->rs_notify_queue);
+		spin_unlock(&rs->rs_lock);
+
+		ro->r_notifier = NULL;
+	}
+
+	spin_unlock(&rm->m_rs_lock);
+
+	if (rs) {
+		rds_wake_sk_sleep(rs);
+		sock_put(rds_rs_to_sk(rs));
+	}
+}
+EXPORT_SYMBOL_GPL(rds_rdma_send_complete);
+
+/*
+ * This is the same as rds_rdma_send_complete except we
+ * don't do any locking - we have all the ingredients (message,
+ * socket, socket lock) and can just move the notifier.
+ */
+static inline void
+__rds_rdma_send_complete(struct rds_sock *rs, struct rds_message *rm, int status)
+{
+	struct rds_rdma_op *ro;
+
+	ro = rm->m_rdma_op;
+	if (ro && ro->r_notify && ro->r_notifier) {
+		ro->r_notifier->n_status = status;
+		list_add_tail(&ro->r_notifier->n_list, &rs->rs_notify_queue);
+		ro->r_notifier = NULL;
+	}
+
+	/* No need to wake the app - caller does this */
+}
+
+/*
+ * This is called from the IB send completion when we detect
+ * a RDMA operation that failed with remote access error.
+ * So speed is not an issue here.
+ */
+struct rds_message *rds_send_get_message(struct rds_connection *conn,
+					 struct rds_rdma_op *op)
+{
+	struct rds_message *rm, *tmp, *found = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&conn->c_lock, flags);
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		if (rm->m_rdma_op == op) {
+			atomic_inc(&rm->m_refcount);
+			found = rm;
+			goto out;
+		}
+	}
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) {
+		if (rm->m_rdma_op == op) {
+			atomic_inc(&rm->m_refcount);
+			found = rm;
+			break;
+		}
+	}
+
+out:
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	return found;
+}
+EXPORT_SYMBOL_GPL(rds_send_get_message);
+
+/*
+ * This removes messages from the socket's list if they're on it.  The list
+ * argument must be private to the caller, we must be able to modify it
+ * without locks.  The messages must have a reference held for their
+ * position on the list.  This function will drop that reference after
+ * removing the messages from the 'messages' list regardless of if it found
+ * the messages on the socket list or not.
+ */
+void rds_send_remove_from_sock(struct list_head *messages, int status)
+{
+	unsigned long flags = 0; /* silence gcc :P */
+	struct rds_sock *rs = NULL;
+	struct rds_message *rm;
+
+	local_irq_save(flags);
+	while (!list_empty(messages)) {
+		rm = list_entry(messages->next, struct rds_message,
+				m_conn_item);
+		list_del_init(&rm->m_conn_item);
+
+		/*
+		 * If we see this flag cleared then we're *sure* that someone
+		 * else beat us to removing it from the sock.  If we race
+		 * with their flag update we'll get the lock and then really
+		 * see that the flag has been cleared.
+		 *
+		 * The message spinlock makes sure nobody clears rm->m_rs
+		 * while we're messing with it. It does not prevent the
+		 * message from being removed from the socket, though.
+		 */
+		spin_lock(&rm->m_rs_lock);
+		if (!test_bit(RDS_MSG_ON_SOCK, &rm->m_flags))
+			goto unlock_and_drop;
+
+		if (rs != rm->m_rs) {
+			if (rs) {
+				spin_unlock(&rs->rs_lock);
+				rds_wake_sk_sleep(rs);
+				sock_put(rds_rs_to_sk(rs));
+			}
+			rs = rm->m_rs;
+			spin_lock(&rs->rs_lock);
+			sock_hold(rds_rs_to_sk(rs));
+		}
+
+		if (test_and_clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags)) {
+			struct rds_rdma_op *ro = rm->m_rdma_op;
+			struct rds_notifier *notifier;
+
+			list_del_init(&rm->m_sock_item);
+			rds_send_sndbuf_remove(rs, rm);
+
+			if (ro &&
+			    (notifier = ro->r_notifier) != NULL &&
+			    (status || ro->r_notify)) {
+				list_add_tail(&notifier->n_list,
+						&rs->rs_notify_queue);
+				if (!notifier->n_status)
+					notifier->n_status = status;
+				rm->m_rdma_op->r_notifier = NULL;
+			}
+			rds_message_put(rm);
+			rm->m_rs = NULL;
+		}
+
+unlock_and_drop:
+		spin_unlock(&rm->m_rs_lock);
+		rds_message_put(rm);
+	}
+
+	if (rs) {
+		spin_unlock(&rs->rs_lock);
+		rds_wake_sk_sleep(rs);
+		sock_put(rds_rs_to_sk(rs));
+	}
+	local_irq_restore(flags);
+}
+
+/*
+ * Transports call here when they've determined that the receiver queued
+ * messages up to, and including, the given sequence number.  Messages are
+ * moved to the retrans queue when rds_send_xmit picks them off the send
+ * queue. This means that in the TCP case, the message may not have been
+ * assigned the m_ack_seq yet - but that's fine as long as tcp_is_acked
+ * checks the RDS_MSG_HAS_ACK_SEQ bit.
+ *
+ * XXX It's not clear to me how this is safely serialized with socket
+ * destruction.  Maybe it should bail if it sees SOCK_DEAD.
+ */
+void rds_send_drop_acked(struct rds_connection *conn, u64 ack,
+			 is_acked_func is_acked)
+{
+	struct rds_message *rm, *tmp;
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	spin_lock_irqsave(&conn->c_lock, flags);
+
+	list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) {
+		if (!rds_send_is_acked(rm, ack, is_acked))
+			break;
+
+		list_move(&rm->m_conn_item, &list);
+		clear_bit(RDS_MSG_ON_CONN, &rm->m_flags);
+	}
+
+	/* order flag updates with spin locks */
+	if (!list_empty(&list))
+		smp_mb__after_clear_bit();
+
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	/* now remove the messages from the sock list as needed */
+	rds_send_remove_from_sock(&list, RDS_RDMA_SUCCESS);
+}
+EXPORT_SYMBOL_GPL(rds_send_drop_acked);
+
+void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
+{
+	struct rds_message *rm, *tmp;
+	struct rds_connection *conn;
+	unsigned long flags;
+	LIST_HEAD(list);
+	int wake = 0;
+
+	/* get all the messages we're dropping under the rs lock */
+	spin_lock_irqsave(&rs->rs_lock, flags);
+
+	list_for_each_entry_safe(rm, tmp, &rs->rs_send_queue, m_sock_item) {
+		if (dest && (dest->sin_addr.s_addr != rm->m_daddr ||
+			     dest->sin_port != rm->m_inc.i_hdr.h_dport))
+			continue;
+
+		wake = 1;
+		list_move(&rm->m_sock_item, &list);
+		rds_send_sndbuf_remove(rs, rm);
+		clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
+
+		/* If this is a RDMA operation, notify the app. */
+		__rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED);
+	}
+
+	/* order flag updates with the rs lock */
+	if (wake)
+		smp_mb__after_clear_bit();
+
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	if (wake)
+		rds_wake_sk_sleep(rs);
+
+	conn = NULL;
+
+	/* now remove the messages from the conn list as needed */
+	list_for_each_entry(rm, &list, m_sock_item) {
+		/* We do this here rather than in the loop above, so that
+		 * we don't have to nest m_rs_lock under rs->rs_lock */
+		spin_lock(&rm->m_rs_lock);
+		rm->m_rs = NULL;
+		spin_unlock(&rm->m_rs_lock);
+
+		/*
+		 * If we see this flag cleared then we're *sure* that someone
+		 * else beat us to removing it from the conn.  If we race
+		 * with their flag update we'll get the lock and then really
+		 * see that the flag has been cleared.
+		 */
+		if (!test_bit(RDS_MSG_ON_CONN, &rm->m_flags))
+			continue;
+
+		if (conn != rm->m_inc.i_conn) {
+			if (conn)
+				spin_unlock_irqrestore(&conn->c_lock, flags);
+			conn = rm->m_inc.i_conn;
+			spin_lock_irqsave(&conn->c_lock, flags);
+		}
+
+		if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) {
+			list_del_init(&rm->m_conn_item);
+			rds_message_put(rm);
+		}
+	}
+
+	if (conn)
+		spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	while (!list_empty(&list)) {
+		rm = list_entry(list.next, struct rds_message, m_sock_item);
+		list_del_init(&rm->m_sock_item);
+
+		rds_message_wait(rm);
+		rds_message_put(rm);
+	}
+}
+
+/*
+ * we only want this to fire once so we use the callers 'queued'.  It's
+ * possible that another thread can race with us and remove the
+ * message from the flow with RDS_CANCEL_SENT_TO.
+ */
+static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
+			     struct rds_message *rm, __be16 sport,
+			     __be16 dport, int *queued)
+{
+	unsigned long flags;
+	u32 len;
+
+	if (*queued)
+		goto out;
+
+	len = be32_to_cpu(rm->m_inc.i_hdr.h_len);
+
+	/* this is the only place which holds both the socket's rs_lock
+	 * and the connection's c_lock */
+	spin_lock_irqsave(&rs->rs_lock, flags);
+
+	/*
+	 * If there is a little space in sndbuf, we don't queue anything,
+	 * and userspace gets -EAGAIN. But poll() indicates there's send
+	 * room. This can lead to bad behavior (spinning) if snd_bytes isn't
+	 * freed up by incoming acks. So we check the *old* value of
+	 * rs_snd_bytes here to allow the last msg to exceed the buffer,
+	 * and poll() now knows no more data can be sent.
+	 */
+	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) {
+		rs->rs_snd_bytes += len;
+
+		/* let recv side know we are close to send space exhaustion.
+		 * This is probably not the optimal way to do it, as this
+		 * means we set the flag on *all* messages as soon as our
+		 * throughput hits a certain threshold.
+		 */
+		if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2)
+			__set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags);
+
+		list_add_tail(&rm->m_sock_item, &rs->rs_send_queue);
+		set_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
+		rds_message_addref(rm);
+		rm->m_rs = rs;
+
+		/* The code ordering is a little weird, but we're
+		   trying to minimize the time we hold c_lock */
+		rds_message_populate_header(&rm->m_inc.i_hdr, sport, dport, 0);
+		rm->m_inc.i_conn = conn;
+		rds_message_addref(rm);
+
+		spin_lock(&conn->c_lock);
+		rm->m_inc.i_hdr.h_sequence = cpu_to_be64(conn->c_next_tx_seq++);
+		list_add_tail(&rm->m_conn_item, &conn->c_send_queue);
+		set_bit(RDS_MSG_ON_CONN, &rm->m_flags);
+		spin_unlock(&conn->c_lock);
+
+		rdsdebug("queued msg %p len %d, rs %p bytes %d seq %llu\n",
+			 rm, len, rs, rs->rs_snd_bytes,
+			 (unsigned long long)be64_to_cpu(rm->m_inc.i_hdr.h_sequence));
+
+		*queued = 1;
+	}
+
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+out:
+	return *queued;
+}
+
+static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
+			 struct msghdr *msg, int *allocated_mr)
+{
+	struct cmsghdr *cmsg;
+	int ret = 0;
+
+	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+		if (!CMSG_OK(msg, cmsg))
+			return -EINVAL;
+
+		if (cmsg->cmsg_level != SOL_RDS)
+			continue;
+
+		/* As a side effect, RDMA_DEST and RDMA_MAP will set
+		 * rm->m_rdma_cookie and rm->m_rdma_mr.
+		 */
+		switch (cmsg->cmsg_type) {
+		case RDS_CMSG_RDMA_ARGS:
+			ret = rds_cmsg_rdma_args(rs, rm, cmsg);
+			break;
+
+		case RDS_CMSG_RDMA_DEST:
+			ret = rds_cmsg_rdma_dest(rs, rm, cmsg);
+			break;
+
+		case RDS_CMSG_RDMA_MAP:
+			ret = rds_cmsg_rdma_map(rs, rm, cmsg);
+			if (!ret)
+				*allocated_mr = 1;
+			break;
+
+		default:
+			return -EINVAL;
+		}
+
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t payload_len)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
+	__be32 daddr;
+	__be16 dport;
+	struct rds_message *rm = NULL;
+	struct rds_connection *conn;
+	int ret = 0;
+	int queued = 0, allocated_mr = 0;
+	int nonblock = msg->msg_flags & MSG_DONTWAIT;
+	long timeo = sock_rcvtimeo(sk, nonblock);
+
+	/* Mirror Linux UDP mirror of BSD error message compatibility */
+	/* XXX: Perhaps MSG_MORE someday */
+	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) {
+		printk(KERN_INFO "msg_flags 0x%08X\n", msg->msg_flags);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (msg->msg_namelen) {
+		/* XXX fail non-unicast destination IPs? */
+		if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != AF_INET) {
+			ret = -EINVAL;
+			goto out;
+		}
+		daddr = usin->sin_addr.s_addr;
+		dport = usin->sin_port;
+	} else {
+		/* We only care about consistency with ->connect() */
+		lock_sock(sk);
+		daddr = rs->rs_conn_addr;
+		dport = rs->rs_conn_port;
+		release_sock(sk);
+	}
+
+	/* racing with another thread binding seems ok here */
+	if (daddr == 0 || rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	rm = rds_message_copy_from_user(msg->msg_iov, payload_len);
+	if (IS_ERR(rm)) {
+		ret = PTR_ERR(rm);
+		rm = NULL;
+		goto out;
+	}
+
+	rm->m_daddr = daddr;
+
+	/* Parse any control messages the user may have included. */
+	ret = rds_cmsg_send(rs, rm, msg, &allocated_mr);
+	if (ret)
+		goto out;
+
+	/* rds_conn_create has a spinlock that runs with IRQ off.
+	 * Caching the conn in the socket helps a lot. */
+	if (rs->rs_conn && rs->rs_conn->c_faddr == daddr)
+		conn = rs->rs_conn;
+	else {
+		conn = rds_conn_create_outgoing(rs->rs_bound_addr, daddr,
+					rs->rs_transport,
+					sock->sk->sk_allocation);
+		if (IS_ERR(conn)) {
+			ret = PTR_ERR(conn);
+			goto out;
+		}
+		rs->rs_conn = conn;
+	}
+
+	if ((rm->m_rdma_cookie || rm->m_rdma_op)
+	 && conn->c_trans->xmit_rdma == NULL) {
+		if (printk_ratelimit())
+			printk(KERN_NOTICE "rdma_op %p conn xmit_rdma %p\n",
+				rm->m_rdma_op, conn->c_trans->xmit_rdma);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	/* If the connection is down, trigger a connect. We may
+	 * have scheduled a delayed reconnect however - in this case
+	 * we should not interfere.
+	 */
+	if (rds_conn_state(conn) == RDS_CONN_DOWN
+	 && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_conn_w, 0);
+
+	ret = rds_cong_wait(conn->c_fcong, dport, nonblock, rs);
+	if (ret)
+		goto out;
+
+	while (!rds_send_queue_rm(rs, conn, rm, rs->rs_bound_port,
+				  dport, &queued)) {
+		rds_stats_inc(s_send_queue_full);
+		/* XXX make sure this is reasonable */
+		if (payload_len > rds_sk_sndbuf(rs)) {
+			ret = -EMSGSIZE;
+			goto out;
+		}
+		if (nonblock) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		timeo = wait_event_interruptible_timeout(*sk->sk_sleep,
+					rds_send_queue_rm(rs, conn, rm,
+							  rs->rs_bound_port,
+							  dport,
+							  &queued),
+					timeo);
+		rdsdebug("sendmsg woke queued %d timeo %ld\n", queued, timeo);
+		if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT)
+			continue;
+
+		ret = timeo;
+		if (ret == 0)
+			ret = -ETIMEDOUT;
+		goto out;
+	}
+
+	/*
+	 * By now we've committed to the send.  We reuse rds_send_worker()
+	 * to retry sends in the rds thread if the transport asks us to.
+	 */
+	rds_stats_inc(s_send_queued);
+
+	if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+		rds_send_worker(&conn->c_send_w.work);
+
+	rds_message_put(rm);
+	return payload_len;
+
+out:
+	/* If the user included a RDMA_MAP cmsg, we allocated a MR on the fly.
+	 * If the sendmsg goes through, we keep the MR. If it fails with EAGAIN
+	 * or in any other way, we need to destroy the MR again */
+	if (allocated_mr)
+		rds_rdma_unuse(rs, rds_rdma_cookie_key(rm->m_rdma_cookie), 1);
+
+	if (rm)
+		rds_message_put(rm);
+	return ret;
+}
+
+/*
+ * Reply to a ping packet.
+ */
+int
+rds_send_pong(struct rds_connection *conn, __be16 dport)
+{
+	struct rds_message *rm;
+	unsigned long flags;
+	int ret = 0;
+
+	rm = rds_message_alloc(0, GFP_ATOMIC);
+	if (rm == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	rm->m_daddr = conn->c_faddr;
+
+	/* If the connection is down, trigger a connect. We may
+	 * have scheduled a delayed reconnect however - in this case
+	 * we should not interfere.
+	 */
+	if (rds_conn_state(conn) == RDS_CONN_DOWN
+	 && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_conn_w, 0);
+
+	ret = rds_cong_wait(conn->c_fcong, dport, 1, NULL);
+	if (ret)
+		goto out;
+
+	spin_lock_irqsave(&conn->c_lock, flags);
+	list_add_tail(&rm->m_conn_item, &conn->c_send_queue);
+	set_bit(RDS_MSG_ON_CONN, &rm->m_flags);
+	rds_message_addref(rm);
+	rm->m_inc.i_conn = conn;
+
+	rds_message_populate_header(&rm->m_inc.i_hdr, 0, dport,
+				    conn->c_next_tx_seq);
+	conn->c_next_tx_seq++;
+	spin_unlock_irqrestore(&conn->c_lock, flags);
+
+	rds_stats_inc(s_send_queued);
+	rds_stats_inc(s_send_pong);
+
+	queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+	rds_message_put(rm);
+	return 0;
+
+out:
+	if (rm)
+		rds_message_put(rm);
+	return ret;
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 11/21] RDS: recv.c
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (9 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 10/21] RDS: send.c Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 12/21] RDS: RDMA support Andy Grover
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Upon receiving a datagram from the transport, RDS parses the
headers and potentially queues an ACK.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/recv.c |  550 +++++++++++++++++++++++++++++++++++++
 1 files changed, 550 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/recv.c

diff --git a/drivers/infiniband/ulp/rds/recv.c b/drivers/infiniband/ulp/rds/recv.c
new file mode 100644
index 0000000..691f8cb
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/recv.c
@@ -0,0 +1,550 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <net/sock.h>
+#include <linux/in.h>
+
+#include "rds.h"
+#include "rdma.h"
+
+void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
+		  __be32 saddr)
+{
+	atomic_set(&inc->i_refcount, 1);
+	INIT_LIST_HEAD(&inc->i_item);
+	inc->i_conn = conn;
+	inc->i_saddr = saddr;
+	inc->i_rdma_cookie = 0;
+}
+EXPORT_SYMBOL_GPL(rds_inc_init);
+
+void rds_inc_addref(struct rds_incoming *inc)
+{
+	rdsdebug("addref inc %p ref %d\n", inc, atomic_read(&inc->i_refcount));
+	atomic_inc(&inc->i_refcount);
+}
+EXPORT_SYMBOL_GPL(rds_inc_addref);
+
+void rds_inc_put(struct rds_incoming *inc)
+{
+	rdsdebug("put inc %p ref %d\n", inc, atomic_read(&inc->i_refcount));
+	if (atomic_dec_and_test(&inc->i_refcount)) {
+		BUG_ON(!list_empty(&inc->i_item));
+
+		inc->i_conn->c_trans->inc_free(inc);
+	}
+}
+EXPORT_SYMBOL_GPL(rds_inc_put);
+
+static void rds_recv_rcvbuf_delta(struct rds_sock *rs, struct sock *sk,
+				  struct rds_cong_map *map,
+				  int delta, __be16 port)
+{
+	int now_congested;
+
+	if (delta == 0)
+		return;
+
+	rs->rs_rcv_bytes += delta;
+	now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs);
+
+	rdsdebug("rs %p recv bytes %d buf %d now_cong %d\n",
+		 rs, rs->rs_rcv_bytes, rds_sk_rcvbuf(rs), now_congested);
+
+	rdstrace(RDS_CONG, RDS_VERBOSE,
+	  "rs %p (%u.%u.%u.%u:%u) recv bytes %d buf %d "
+	  "now_cong %d delta %d\n",
+	  rs, NIPQUAD(rs->rs_bound_addr),
+	  (int)ntohs(rs->rs_bound_port), rs->rs_rcv_bytes,
+	  rds_sk_rcvbuf(rs), now_congested, delta);
+
+	/* wasn't -> am congested */
+	if (!rs->rs_congested && now_congested) {
+		rs->rs_congested = 1;
+		rds_cong_set_bit(map, port);
+		rds_cong_queue_updates(map);
+	}
+	/* was -> aren't congested */
+	/* Require more free space before reporting uncongested to prevent
+           bouncing cong/uncong state too often */
+	else if (rs->rs_congested && (rs->rs_rcv_bytes < (rds_sk_rcvbuf(rs)/2))) {
+		rs->rs_congested = 0;
+		rds_cong_clear_bit(map, port);
+		rds_cong_queue_updates(map);
+	}
+
+	/* do nothing if no change in cong state */
+}
+
+/*
+ * Process all extension headers that come with this message.
+ */
+static void rds_recv_incoming_exthdrs(struct rds_incoming *inc, struct rds_sock *rs)
+{
+	struct rds_header *hdr = &inc->i_hdr;
+	unsigned int pos = 0, type, len;
+	union {
+		struct rds_ext_header_version version;
+		struct rds_ext_header_rdma rdma;
+		struct rds_ext_header_rdma_dest rdma_dest;
+	} buffer;
+
+	while (1) {
+		len = sizeof(buffer);
+		type = rds_message_next_extension(hdr, &pos, &buffer, &len);
+		if (type == RDS_EXTHDR_NONE)
+			break;
+		/* Process extension header here */
+		switch (type) {
+		case RDS_EXTHDR_RDMA:
+			rds_rdma_unuse(rs, be32_to_cpu(buffer.rdma.h_rdma_rkey), 0);
+			break;
+
+		case RDS_EXTHDR_RDMA_DEST:
+			/* We ignore the size for now. We could stash it
+			 * somewhere and use it for error checking. */
+			inc->i_rdma_cookie = rds_rdma_make_cookie(
+					be32_to_cpu(buffer.rdma_dest.h_rdma_rkey),
+					be32_to_cpu(buffer.rdma_dest.h_rdma_offset));
+
+			break;
+		}
+	}
+}
+
+/*
+ * The transport must make sure that this is serialized against other
+ * rx and conn reset on this specific conn.
+ *
+ * We currently assert that only one fragmented message will be sent
+ * down a connection at a time.  This lets us reassemble in the conn
+ * instead of per-flow which means that we don't have to go digging through
+ * flows to tear down partial reassembly progress on conn failure and
+ * we save flow lookup and locking for each frag arrival.  It does mean
+ * that small messages will wait behind large ones.  Fragmenting at all
+ * is only to reduce the memory consumption of pre-posted buffers.
+ *
+ * The caller passes in saddr and daddr instead of us getting it from the
+ * conn.  This lets loopback, who only has one conn for both directions,
+ * tell us which roles the addrs in the conn are playing for this message.
+ */
+void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
+		       struct rds_incoming *inc, gfp_t gfp, enum km_type km)
+{
+	struct rds_sock *rs = NULL;
+	struct sock *sk;
+	unsigned long flags;
+
+	inc->i_conn = conn;
+	inc->i_rx_jiffies = jiffies;
+
+	rdsdebug("conn %p next %llu inc %p seq %llu len %u sport %u dport %u "
+		 "flags 0x%x rx_jiffies %lu\n", conn,
+		 (unsigned long long)conn->c_next_rx_seq,
+		 inc,
+		 (unsigned long long)be64_to_cpu(inc->i_hdr.h_sequence),
+		 be32_to_cpu(inc->i_hdr.h_len),
+		 be16_to_cpu(inc->i_hdr.h_sport),
+		 be16_to_cpu(inc->i_hdr.h_dport),
+		 inc->i_hdr.h_flags,
+		 inc->i_rx_jiffies);
+
+	/*
+	 * Sequence numbers should only increase.  Messages get their
+	 * sequence number as they're queued in a sending conn.  They
+	 * can be dropped, though, if the sending socket is closed before
+	 * they hit the wire.  So sequence numbers can skip forward
+	 * under normal operation.  They can also drop back in the conn
+	 * failover case as previously sent messages are resent down the
+	 * new instance of a conn.  We drop those, otherwise we have
+	 * to assume that the next valid seq does not come after a
+	 * hole in the fragment stream.
+	 *
+	 * The headers don't give us a way to realize if fragments of
+	 * a message have been dropped.  We assume that frags that arrive
+	 * to a flow are part of the current message on the flow that is
+	 * being reassembled.  This means that senders can't drop messages
+	 * from the sending conn until all their frags are sent.
+	 *
+	 * XXX we could spend more on the wire to get more robust failure
+	 * detection, arguably worth it to avoid data corruption.
+	 */
+	if (be64_to_cpu(inc->i_hdr.h_sequence) < conn->c_next_rx_seq
+	 && (inc->i_hdr.h_flags & RDS_FLAG_RETRANSMITTED)) {
+		rds_stats_inc(s_recv_drop_old_seq);
+		goto out;
+	}
+	conn->c_next_rx_seq = be64_to_cpu(inc->i_hdr.h_sequence) + 1;
+
+	if (rds_sysctl_ping_enable && inc->i_hdr.h_dport == 0) {
+		rds_stats_inc(s_recv_ping);
+		rds_send_pong(conn, inc->i_hdr.h_sport);
+		goto out;
+	}
+
+	rs = rds_find_bound(daddr, inc->i_hdr.h_dport);
+	if (rs == NULL) {
+		rds_stats_inc(s_recv_drop_no_sock);
+		goto out;
+	}
+
+	/* Process extension headers */
+	rds_recv_incoming_exthdrs(inc, rs);
+
+	/* We can be racing with rds_release() which marks the socket dead. */
+	sk = rds_rs_to_sk(rs);
+
+	/* serialize with rds_release -> sock_orphan */
+	write_lock_irqsave(&rs->rs_recv_lock, flags);
+	if (!sock_flag(sk, SOCK_DEAD)) {
+		rdsdebug("adding inc %p to rs %p's recv queue\n", inc, rs);
+		rds_stats_inc(s_recv_queued);
+		rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
+				      be32_to_cpu(inc->i_hdr.h_len),
+				      inc->i_hdr.h_dport);
+		rds_inc_addref(inc);
+		list_add_tail(&inc->i_item, &rs->rs_recv_queue);
+		__rds_wake_sk_sleep(sk);
+	} else {
+		rds_stats_inc(s_recv_drop_dead_sock);
+	}
+	write_unlock_irqrestore(&rs->rs_recv_lock, flags);
+
+out:
+	if (rs)
+		rds_sock_put(rs);
+}
+EXPORT_SYMBOL_GPL(rds_recv_incoming);
+
+/*
+ * be very careful here.  This is being called as the condition in
+ * wait_event_*() needs to cope with being called many times.
+ */
+static int rds_next_incoming(struct rds_sock *rs, struct rds_incoming **inc)
+{
+	unsigned long flags;
+
+	if (*inc == NULL) {
+		read_lock_irqsave(&rs->rs_recv_lock, flags);
+		if (!list_empty(&rs->rs_recv_queue)) {
+			*inc = list_entry(rs->rs_recv_queue.next,
+					  struct rds_incoming,
+					  i_item);
+			rds_inc_addref(*inc);
+		}
+		read_unlock_irqrestore(&rs->rs_recv_lock, flags);
+	}
+
+	return *inc != NULL;
+}
+
+static int rds_still_queued(struct rds_sock *rs, struct rds_incoming *inc,
+			    int drop)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	int ret = 0;
+	unsigned long flags;
+
+	write_lock_irqsave(&rs->rs_recv_lock, flags);
+	if (!list_empty(&inc->i_item)) {
+		ret = 1;
+		if (drop) {
+			/* XXX make sure this i_conn is reliable */
+			rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
+					      -be32_to_cpu(inc->i_hdr.h_len),
+					      inc->i_hdr.h_dport);
+			list_del_init(&inc->i_item);
+			rds_inc_put(inc);
+		}
+	}
+	write_unlock_irqrestore(&rs->rs_recv_lock, flags);
+
+	rdsdebug("inc %p rs %p still %d dropped %d\n", inc, rs, ret, drop);
+	return ret;
+}
+
+/*
+ * Pull errors off the error queue.
+ * If msghdr is NULL, we will just purge the error queue.
+ */
+int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msghdr)
+{
+	struct rds_notifier *notifier;
+	struct rds_rdma_notify cmsg;
+	unsigned int count = 0, max_messages = ~0U;
+	unsigned long flags;
+	LIST_HEAD(copy);
+	int err = 0;
+
+
+	/* put_cmsg copies to user space and thus may sleep. We can't do this
+	 * with rs_lock held, so first grab as many notifications as we can stuff
+	 * in the user provided cmsg buffer. We don't try to copy more, to avoid
+	 * losing notifications - except when the buffer is so small that it wouldn't
+	 * even hold a single notification. Then we give him as much of this single
+	 * msg as we can squeeze in, and set MSG_CTRUNC.
+	 */
+	if (msghdr) {
+		max_messages = msghdr->msg_controllen / CMSG_SPACE(sizeof(cmsg));
+		if (!max_messages)
+			max_messages = 1;
+	}
+
+	spin_lock_irqsave(&rs->rs_lock, flags);
+	while (!list_empty(&rs->rs_notify_queue) && count < max_messages) {
+		notifier = list_entry(rs->rs_notify_queue.next,
+				struct rds_notifier, n_list);
+		list_move(&notifier->n_list, &copy);
+		count++;
+	}
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	if (!count)
+		return 0;
+
+	while (!list_empty(&copy)) {
+		notifier = list_entry(copy.next, struct rds_notifier, n_list);
+
+		if (msghdr) {
+			cmsg.user_token = notifier->n_user_token;
+			cmsg.status  = notifier->n_status;
+
+			err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_RDMA_STATUS,
+					sizeof(cmsg), &cmsg);
+			if (err)
+				break;
+		}
+
+		list_del_init(&notifier->n_list);
+		kfree(notifier);
+	}
+
+	/* If we bailed out because of an error in put_cmsg,
+	 * we may be left with one or more notifications that we
+	 * didn't process. Return them to the head of the list. */
+	if (!list_empty(&copy)) {
+		spin_lock_irqsave(&rs->rs_lock, flags);
+		list_splice(&copy, &rs->rs_notify_queue);
+		spin_unlock_irqrestore(&rs->rs_lock, flags);
+	}
+
+	return err;
+}
+
+/*
+ * Queue a congestion notification
+ */
+static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr)
+{
+	uint64_t notify = rs->rs_cong_notify;
+	unsigned long flags;
+	int err;
+
+	err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_CONG_UPDATE,
+			sizeof(notify), &notify);
+	if (err)
+		return err;
+
+	spin_lock_irqsave(&rs->rs_lock, flags);
+	rs->rs_cong_notify &= ~notify;
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	return 0;
+}
+
+/*
+ * Receive any control messages.
+ */
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+{
+	int ret = 0;
+
+	if (inc->i_rdma_cookie) {
+		ret = put_cmsg(msg, SOL_RDS, RDS_CMSG_RDMA_DEST,
+				sizeof(inc->i_rdma_cookie), &inc->i_rdma_cookie);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		size_t size, int msg_flags)
+{
+	struct sock *sk = sock->sk;
+	struct rds_sock *rs = rds_sk_to_rs(sk);
+	long timeo;
+	int ret = 0, nonblock = msg_flags & MSG_DONTWAIT;
+	struct sockaddr_in *sin;
+	struct rds_incoming *inc = NULL;
+
+	/* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */
+	timeo = sock_rcvtimeo(sk, nonblock);
+
+	rdsdebug("size %zu flags 0x%x timeo %ld\n", size, msg_flags, timeo);
+
+	if (msg_flags & MSG_OOB)
+		goto out;
+
+	/* If there are pending notifications, do those - and nothing else */
+	if (!list_empty(&rs->rs_notify_queue)) {
+		ret = rds_notify_queue_get(rs, msg);
+		goto out;
+	}
+
+	if (rs->rs_cong_notify) {
+		ret = rds_notify_cong(rs, msg);
+		goto out;
+	}
+
+	while (1) {
+		if (!rds_next_incoming(rs, &inc)) {
+			if (nonblock) {
+				ret = -EAGAIN;
+				break;
+			}
+
+			timeo = wait_event_interruptible_timeout(*sk->sk_sleep,
+						rds_next_incoming(rs, &inc),
+						timeo);
+			rdsdebug("recvmsg woke inc %p timeo %ld\n", inc,
+				 timeo);
+			if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT)
+				continue;
+
+			ret = timeo;
+			if (ret == 0)
+				ret = -ETIMEDOUT;
+			break;
+		}
+
+		rdsdebug("copying inc %p from %u.%u.%u.%u:%u to user\n", inc,
+			 NIPQUAD(inc->i_conn->c_faddr),
+			 ntohs(inc->i_hdr.h_sport));
+		ret = inc->i_conn->c_trans->inc_copy_to_user(inc, msg->msg_iov,
+							     size);
+		if (ret < 0)
+			break;
+
+		/*
+		 * if the message we just copied isn't at the head of the
+		 * recv queue then someone else raced us to return it, try
+		 * to get the next message.
+		 */
+		if (!rds_still_queued(rs, inc, !(msg_flags & MSG_PEEK))) {
+			rds_inc_put(inc);
+			inc = NULL;
+			rds_stats_inc(s_recv_deliver_raced);
+			continue;
+		}
+
+		if (ret < be32_to_cpu(inc->i_hdr.h_len)) {
+			if (msg_flags & MSG_TRUNC)
+				ret = be32_to_cpu(inc->i_hdr.h_len);
+			msg->msg_flags |= MSG_TRUNC;
+		}
+
+		if (rds_cmsg_recv(inc, msg)) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		rds_stats_inc(s_recv_delivered);
+
+		sin = (struct sockaddr_in *)msg->msg_name;
+		if (sin) {
+			sin->sin_family = AF_INET;
+			sin->sin_port = inc->i_hdr.h_sport;
+			sin->sin_addr.s_addr = inc->i_saddr;
+			memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+		}
+		break;
+	}
+
+	if (inc)
+		rds_inc_put(inc);
+
+out:
+	return ret;
+}
+
+/*
+ * The socket is being shut down and we're asked to drop messages that were
+ * queued for recvmsg.  The caller has unbound the socket so the receive path
+ * won't queue any more incoming fragments or messages on the socket.
+ */
+void rds_clear_recv_queue(struct rds_sock *rs)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	struct rds_incoming *inc, *tmp;
+	unsigned long flags;
+
+	write_lock_irqsave(&rs->rs_recv_lock, flags);
+	list_for_each_entry_safe(inc, tmp, &rs->rs_recv_queue, i_item) {
+		rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
+				      -be32_to_cpu(inc->i_hdr.h_len),
+				      inc->i_hdr.h_dport);
+		list_del_init(&inc->i_item);
+		rds_inc_put(inc);
+	}
+	write_unlock_irqrestore(&rs->rs_recv_lock, flags);
+}
+
+/*
+ * inc->i_saddr isn't used here because it is only set in the receive
+ * path.
+ */
+void rds_inc_info_copy(struct rds_incoming *inc,
+		       struct rds_info_iterator *iter,
+		       __be32 saddr, __be32 daddr, int flip)
+{
+	struct rds_info_message minfo;
+
+	minfo.seq = be64_to_cpu(inc->i_hdr.h_sequence);
+	minfo.len = be32_to_cpu(inc->i_hdr.h_len);
+
+	if (flip) {
+		minfo.laddr = daddr;
+		minfo.faddr = saddr;
+		minfo.lport = inc->i_hdr.h_dport;
+		minfo.fport = inc->i_hdr.h_sport;
+	} else {
+		minfo.laddr = saddr;
+		minfo.faddr = daddr;
+		minfo.lport = inc->i_hdr.h_sport;
+		minfo.fport = inc->i_hdr.h_dport;
+	}
+
+	rds_info_copy(iter, &minfo, sizeof(minfo));
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 12/21] RDS: RDMA support
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (10 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 11/21] RDS: recv.c Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [ofa-general] [PATCH 13/21] RDS/IB: Infiniband transport Andy Grover
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Some transports may support RDMA features. This handles the
non-transport-specific parts, like pinning user pages and
tracking mapped regions.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/rdma.c     |  682 +++++++++++++++++++++++++++++++++
 drivers/infiniband/ulp/rds/rdma.h     |   84 ++++
 drivers/infiniband/ulp/rds/rds_rdma.h |  245 ++++++++++++
 3 files changed, 1011 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/rdma.c
 create mode 100644 drivers/infiniband/ulp/rds/rdma.h
 create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h

diff --git a/drivers/infiniband/ulp/rds/rdma.c b/drivers/infiniband/ulp/rds/rdma.c
new file mode 100644
index 0000000..00e3450
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/rdma.c
@@ -0,0 +1,682 @@
+/*
+ * Copyright (c) 2007 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/pagemap.h>
+#include <linux/rbtree.h>
+#include <linux/dma-mapping.h> /* for DMA_*_DEVICE */
+
+#include "rdma.h"
+
+/*
+ * XXX
+ *  - build with sparse
+ *  - should we limit the size of a mr region?  let transport return failure?
+ *  - should we detect duplicate keys on a socket?  hmm.
+ *  - an rdma is an mlock, apply rlimit?
+ */
+
+/*
+ * get the number of pages by looking at the page indices that the start and
+ * end addresses fall in.
+ *
+ * Returns 0 if the vec is invalid.  It is invalid if the number of bytes
+ * causes the address to wrap or overflows an unsigned int.  This comes
+ * from being stored in the 'length' member of 'struct scatterlist'.
+ */
+static unsigned int rds_pages_in_vec(struct rds_iovec *vec)
+{
+	if ((vec->addr + vec->bytes <= vec->addr) ||
+	    (vec->bytes > (u64)UINT_MAX))
+		return 0;
+
+	return ((vec->addr + vec->bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) -
+		(vec->addr >> PAGE_SHIFT);
+}
+
+static struct rds_mr *rds_mr_tree_walk(struct rb_root *root, u64 key,
+				       struct rds_mr *insert)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct rds_mr *mr;
+
+	while (*p) {
+		parent = *p;
+		mr = rb_entry(parent, struct rds_mr, r_rb_node);
+
+		if (key < mr->r_key)
+			p = &(*p)->rb_left;
+		else if (key > mr->r_key)
+			p = &(*p)->rb_right;
+		else
+			return mr;
+	}
+
+	if (insert) {
+		rb_link_node(&insert->r_rb_node, parent, p);
+		rb_insert_color(&insert->r_rb_node, root);
+		atomic_inc(&insert->r_refcount);
+	}
+	return NULL;
+}
+
+/*
+ * Destroy the transport-specific part of a MR.
+ */
+static void rds_destroy_mr(struct rds_mr *mr)
+{
+	struct rds_sock *rs = mr->r_sock;
+	void *trans_private = NULL;
+	unsigned long flags;
+
+	rdsdebug("RDS: destroy mr key is %x refcnt %u\n",
+			mr->r_key, atomic_read(&mr->r_refcount));
+
+	if (test_and_set_bit(RDS_MR_DEAD, &mr->r_state))
+		return;
+
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	if (!RB_EMPTY_NODE(&mr->r_rb_node))
+		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
+	trans_private = mr->r_trans_private;
+	mr->r_trans_private = NULL;
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	if (trans_private)
+		mr->r_trans->free_mr(trans_private, mr->r_invalidate);
+}
+
+void __rds_put_mr_final(struct rds_mr *mr)
+{
+	rds_destroy_mr(mr);
+	kfree(mr);
+}
+
+/*
+ * By the time this is called we can't have any more ioctls called on
+ * the socket so we don't need to worry about racing with others.
+ */
+void rds_rdma_drop_keys(struct rds_sock *rs)
+{
+	struct rds_mr *mr;
+	struct rb_node *node;
+
+	/* Release any MRs associated with this socket */
+	while ((node = rb_first(&rs->rs_rdma_keys))) {
+		mr = container_of(node, struct rds_mr, r_rb_node);
+		if (mr->r_trans == rs->rs_transport)
+			mr->r_invalidate = 0;
+		rds_mr_put(mr);
+	}
+
+	if (rs->rs_transport && rs->rs_transport->flush_mrs)
+		rs->rs_transport->flush_mrs();
+}
+
+/*
+ * Helper function to pin user pages.
+ */
+static int rds_pin_pages(unsigned long user_addr, unsigned int nr_pages,
+			struct page **pages, int write)
+{
+	int ret;
+
+	down_read(&current->mm->mmap_sem);
+	ret = get_user_pages(current, current->mm, user_addr,
+			     nr_pages, write, 0, pages, NULL);
+	up_read(&current->mm->mmap_sem);
+
+	if (0 <= ret && (unsigned) ret < nr_pages) {
+		while (ret--)
+			put_page(pages[ret]);
+		ret = -EFAULT;
+	}
+
+	return ret;
+}
+
+static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args,
+				u64 *cookie_ret, struct rds_mr **mr_ret)
+{
+	struct rds_mr *mr = NULL, *found;
+	unsigned int nr_pages;
+	struct page **pages = NULL;
+	struct scatterlist *sg;
+	void *trans_private;
+	unsigned long flags;
+	rds_rdma_cookie_t cookie;
+	unsigned int nents;
+	long i;
+	int ret;
+
+	if (rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	if (rs->rs_transport->get_mr == NULL) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	nr_pages = rds_pages_in_vec(&args->vec);
+	if (nr_pages == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	rdsdebug("RDS: get_mr addr %llx len %llu nr_pages %u\n",
+		args->vec.addr, args->vec.bytes, nr_pages);
+
+	/* XXX clamp nr_pages to limit the size of this alloc? */
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	mr = kzalloc(sizeof(struct rds_mr), GFP_KERNEL);
+	if (mr == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	atomic_set(&mr->r_refcount, 1);
+	RB_CLEAR_NODE(&mr->r_rb_node);
+	mr->r_trans = rs->rs_transport;
+	mr->r_sock = rs;
+
+	if (args->flags & RDS_RDMA_USE_ONCE)
+		mr->r_use_once = 1;
+	if (args->flags & RDS_RDMA_INVALIDATE)
+		mr->r_invalidate = 1;
+	if (args->flags & RDS_RDMA_READWRITE)
+		mr->r_write = 1;
+
+	/*
+	 * Pin the pages that make up the user buffer and transfer the page
+	 * pointers to the mr's sg array.  We check to see if we've mapped
+	 * the whole region after transferring the partial page references
+	 * to the sg array so that we can have one page ref cleanup path.
+	 *
+	 * For now we have no flag that tells us whether the mapping is
+	 * r/o or r/w. We need to assume r/w, or we'll do a lot of RDMA to
+	 * the zero page.
+	 */
+	ret = rds_pin_pages(args->vec.addr & PAGE_MASK, nr_pages, pages, 1);
+	if (ret < 0)
+		goto out;
+
+	nents = ret;
+	sg = kcalloc(nents, sizeof(*sg), GFP_KERNEL);
+	if (sg == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Stick all pages into the scatterlist */
+	for (i = 0 ; i < nents; i++) {
+#ifdef CONFIG_DEBUG_SG
+		sg[i].sg_magic = SG_MAGIC;
+#endif
+		sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);
+	}
+
+	rdsdebug("RDS: trans_private nents is %u\n", nents);
+
+	/* Obtain a transport specific MR. If this succeeds, the
+	 * s/g list is now owned by the MR.
+	 * Note that dma_map() implies that pending writes are
+	 * flushed to RAM, so no dma_sync is needed here. */
+	trans_private = rs->rs_transport->get_mr(sg, nents, rs,
+						 &mr->r_key);
+
+	if (IS_ERR(trans_private)) {
+		for (i = 0 ; i < nents; i++)
+			put_page(sg_page(&sg[i]));
+		kfree(sg);
+		ret = PTR_ERR(trans_private);
+		goto out;
+	}
+
+	mr->r_trans_private = trans_private;
+
+	rdsdebug("RDS: get_mr put_user key is %x cookie_addr %p\n",
+	       mr->r_key, (void *)(unsigned long) args->cookie_addr);
+
+	/* The user may pass us an unaligned address, but we can only
+	 * map page aligned regions. So we keep the offset, and build
+	 * a 64bit cookie containing <R_Key, offset> and pass that
+	 * around. */
+	cookie = rds_rdma_make_cookie(mr->r_key, args->vec.addr & ~PAGE_MASK);
+	if (cookie_ret)
+		*cookie_ret = cookie;
+
+	if (args->cookie_addr && put_user(cookie, (u64 __user *)(unsigned long) args->cookie_addr)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* Inserting the new MR into the rbtree bumps its
+	 * reference count. */
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	found = rds_mr_tree_walk(&rs->rs_rdma_keys, mr->r_key, mr);
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	BUG_ON(found && found != mr);
+
+	rdsdebug("RDS: get_mr key is %x\n", mr->r_key);
+	if (mr_ret) {
+		atomic_inc(&mr->r_refcount);
+		*mr_ret = mr;
+	}
+
+	ret = 0;
+out:
+	kfree(pages);
+	if (mr)
+		rds_mr_put(mr);
+	return ret;
+}
+
+int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen)
+{
+	struct rds_get_mr_args args;
+
+	if (optlen != sizeof(struct rds_get_mr_args))
+		return -EINVAL;
+
+	if (copy_from_user(&args, (struct rds_get_mr_args __user *)optval,
+			   sizeof(struct rds_get_mr_args)))
+		return -EFAULT;
+
+	return __rds_rdma_map(rs, &args, NULL, NULL);
+}
+
+/*
+ * Free the MR indicated by the given R_Key
+ */
+int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen)
+{
+	struct rds_free_mr_args args;
+	struct rds_mr *mr;
+	unsigned long flags;
+
+	if (optlen != sizeof(struct rds_free_mr_args))
+		return -EINVAL;
+
+	if (copy_from_user(&args, (struct rds_free_mr_args __user *)optval,
+			   sizeof(struct rds_free_mr_args)))
+		return -EFAULT;
+
+	/* Special case - a null cookie means flush all unused MRs */
+	if (args.cookie == 0) {
+		if (!rs->rs_transport || !rs->rs_transport->flush_mrs)
+			return -EINVAL;
+		rs->rs_transport->flush_mrs();
+		return 0;
+	}
+
+	/* Look up the MR given its R_key and remove it from the rbtree
+	 * so nobody else finds it.
+	 * This should also prevent races with rds_rdma_unuse.
+	 */
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	mr = rds_mr_tree_walk(&rs->rs_rdma_keys, rds_rdma_cookie_key(args.cookie), NULL);
+	if (mr) {
+		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
+		RB_CLEAR_NODE(&mr->r_rb_node);
+		if (args.flags & RDS_RDMA_INVALIDATE)
+			mr->r_invalidate = 1;
+	}
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	if (!mr)
+		return -EINVAL;
+
+	/*
+	 * call rds_destroy_mr() ourselves so that we're sure it's done by the time
+	 * we return.  If we let rds_mr_put() do it it might not happen until
+	 * someone else drops their ref.
+	 */
+	rds_destroy_mr(mr);
+	rds_mr_put(mr);
+	return 0;
+}
+
+/*
+ * This is called when we receive an extension header that
+ * tells us this MR was used. It allows us to implement
+ * use_once semantics
+ */
+void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force)
+{
+	struct rds_mr *mr;
+	unsigned long flags;
+	int zot_me = 0;
+
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL);
+	if (mr && (mr->r_use_once || force)) {
+		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
+		RB_CLEAR_NODE(&mr->r_rb_node);
+		zot_me = 1;
+	} else if (mr)
+		atomic_inc(&mr->r_refcount);
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	/* May have to issue a dma_sync on this memory region.
+	 * Note we could avoid this if the operation was a RDMA READ,
+	 * but at this point we can't tell. */
+	if (mr != NULL) {
+		if (mr->r_trans->sync_mr)
+			mr->r_trans->sync_mr(mr->r_trans_private, DMA_FROM_DEVICE);
+
+		/* If the MR was marked as invalidate, this will
+		 * trigger an async flush. */
+		if (zot_me)
+			rds_destroy_mr(mr);
+		rds_mr_put(mr);
+	}
+}
+
+void rds_rdma_free_op(struct rds_rdma_op *ro)
+{
+	unsigned int i;
+
+	for (i = 0; i < ro->r_nents; i++) {
+		struct page *page = sg_page(&ro->r_sg[i]);
+
+		/* Mark page dirty if it was possibly modified, which
+		 * is the case for a RDMA_READ which copies from remote
+		 * to local memory */
+		if (!ro->r_write)
+			set_page_dirty(page);
+		put_page(page);
+	}
+
+	kfree(ro->r_notifier);
+	kfree(ro);
+}
+
+/*
+ * args is a pointer to an in-kernel copy in the sendmsg cmsg.
+ */
+static struct rds_rdma_op *rds_rdma_prepare(struct rds_sock *rs,
+					    struct rds_rdma_args *args)
+{
+	struct rds_iovec vec;
+	struct rds_rdma_op *op = NULL;
+	unsigned int nr_pages;
+	unsigned int max_pages;
+	unsigned int nr_bytes;
+	struct page **pages = NULL;
+	struct rds_iovec __user *local_vec;
+	struct scatterlist *sg;
+	unsigned int nr;
+	unsigned int i, j;
+	int ret;
+
+
+	if (rs->rs_bound_addr == 0) {
+		ret = -ENOTCONN; /* XXX not a great errno */
+		goto out;
+	}
+
+	if (args->nr_local > (u64)UINT_MAX) {
+		ret = -EMSGSIZE;
+		goto out;
+	}
+
+	nr_pages = 0;
+	max_pages = 0;
+
+	local_vec = (struct rds_iovec __user *)(unsigned long) args->local_vec_addr;
+
+	/* figure out the number of pages in the vector */
+	for (i = 0; i < args->nr_local; i++) {
+		if (copy_from_user(&vec, &local_vec[i],
+				   sizeof(struct rds_iovec))) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		nr = rds_pages_in_vec(&vec);
+		if (nr == 0) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		max_pages = max(nr, max_pages);
+		nr_pages += nr;
+	}
+
+	pages = kcalloc(max_pages, sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	op = kzalloc(offsetof(struct rds_rdma_op, r_sg[nr_pages]), GFP_KERNEL);
+	if (op == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	op->r_write = !!(args->flags & RDS_RDMA_READWRITE);
+	op->r_fence = !!(args->flags & RDS_RDMA_FENCE);
+	op->r_notify = !!(args->flags & RDS_RDMA_NOTIFY_ME);
+	op->r_recverr = rs->rs_recverr;
+
+	if (op->r_notify || op->r_recverr) {
+		/* We allocate an uninitialized notifier here, because
+		 * we don't want to do that in the completion handler. We
+		 * would have to use GFP_ATOMIC there, and don't want to deal
+		 * with failed allocations.
+		 */
+		op->r_notifier = kmalloc(sizeof(struct rds_notifier), GFP_KERNEL);
+		if (!op->r_notifier) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		op->r_notifier->n_user_token = args->user_token;
+		op->r_notifier->n_status = RDS_RDMA_SUCCESS;
+	}
+
+	/* The cookie contains the R_Key of the remote memory region, and
+	 * optionally an offset into it. This is how we implement RDMA into
+	 * unaligned memory.
+	 * When setting up the RDMA, we need to add that offset to the
+	 * destination address (which is really an offset into the MR)
+	 * FIXME: We may want to move this into ib_rdma.c
+	 */
+	op->r_key = rds_rdma_cookie_key(args->cookie);
+	op->r_remote_addr = args->remote_vec.addr + rds_rdma_cookie_offset(args->cookie);
+
+	nr_bytes = 0;
+
+	rdsdebug("RDS: rdma prepare nr_local %llu rva %llx rkey %x\n",
+	       (unsigned long long)args->nr_local,
+	       (unsigned long long)args->remote_vec.addr,
+	       op->r_key);
+
+	for (i = 0; i < args->nr_local; i++) {
+		if (copy_from_user(&vec, &local_vec[i],
+				   sizeof(struct rds_iovec))) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		nr = rds_pages_in_vec(&vec);
+		if (nr == 0) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		rs->rs_user_addr = vec.addr;
+		rs->rs_user_bytes = vec.bytes;
+
+		/* did the user change the vec under us? */
+		if (nr > max_pages || op->r_nents + nr > nr_pages) {
+			ret = -EINVAL;
+			goto out;
+		}
+		/* If it's a WRITE operation, we want to pin the pages for reading.
+		 * If it's a READ operation, we need to pin the pages for writing.
+		 */
+		ret = rds_pin_pages(vec.addr & PAGE_MASK, nr, pages, !op->r_write);
+		if (ret < 0)
+			goto out;
+
+		rdsdebug("RDS: nr_bytes %u nr %u vec.bytes %llu vec.addr %llx\n",
+		       nr_bytes, nr, vec.bytes, vec.addr);
+
+		nr_bytes += vec.bytes;
+
+		for (j = 0; j < nr; j++) {
+			unsigned int offset = vec.addr & ~PAGE_MASK;
+
+			sg = &op->r_sg[op->r_nents + j];
+#ifdef CONFIG_DEBUG_SG
+			sg->sg_magic = SG_MAGIC;
+#endif
+			sg_set_page(sg, pages[j],
+					min_t(unsigned int, vec.bytes, PAGE_SIZE - offset),
+					offset);
+
+			rdsdebug("RDS: sg->offset %x sg->len %x vec.addr %llx vec.bytes %llu\n",
+			       sg->offset, sg->length, vec.addr, vec.bytes);
+
+			vec.addr += sg->length;
+			vec.bytes -= sg->length;
+		}
+
+		op->r_nents += nr;
+	}
+
+
+	if (nr_bytes > args->remote_vec.bytes) {
+		rdsdebug("RDS nr_bytes %u remote_bytes %u do not match\n",
+				nr_bytes,
+				(unsigned int) args->remote_vec.bytes);
+		ret = -EINVAL;
+		goto out;
+	}
+	op->r_bytes = nr_bytes;
+
+	ret = 0;
+out:
+	kfree(pages);
+	if (ret) {
+		if (op)
+			rds_rdma_free_op(op);
+		op = ERR_PTR(ret);
+	}
+	return op;
+}
+
+/*
+ * The application asks for a RDMA transfer.
+ * Extract all arguments and set up the rdma_op
+ */
+int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	struct rds_rdma_op *op;
+
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_rdma_args))
+	 || rm->m_rdma_op != NULL)
+		return -EINVAL;
+
+	op = rds_rdma_prepare(rs, CMSG_DATA(cmsg));
+	if (IS_ERR(op))
+		return PTR_ERR(op);
+	rds_stats_inc(s_send_rdma);
+	rm->m_rdma_op = op;
+	return 0;
+}
+
+/*
+ * The application wants us to pass an RDMA destination (aka MR)
+ * to the remote
+ */
+int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	unsigned long flags;
+	struct rds_mr *mr;
+	u32 r_key;
+	int err = 0;
+
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(rds_rdma_cookie_t))
+	 || rm->m_rdma_cookie != 0)
+		return -EINVAL;
+
+	memcpy(&rm->m_rdma_cookie, CMSG_DATA(cmsg), sizeof(rm->m_rdma_cookie));
+
+	/* We are reusing a previously mapped MR here. Most likely, the
+	 * application has written to the buffer, so we need to explicitly
+	 * flush those writes to RAM. Otherwise the HCA may not see them
+	 * when doing a DMA from that buffer.
+	 */
+	r_key = rds_rdma_cookie_key(rm->m_rdma_cookie);
+
+	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
+	mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL);
+	if (mr == NULL)
+		err = -EINVAL;	/* invalid r_key */
+	else
+		atomic_inc(&mr->r_refcount);
+	spin_unlock_irqrestore(&rs->rs_rdma_lock, flags);
+
+	if (mr) {
+		mr->r_trans->sync_mr(mr->r_trans_private, DMA_TO_DEVICE);
+		rm->m_rdma_mr = mr;
+	}
+	return err;
+}
+
+/*
+ * The application passes us an address range it wants to enable RDMA
+ * to/from. We map the area, and save the <R_Key,offset> pair
+ * in rm->m_rdma_cookie. This causes it to be sent along to the peer
+ * in an extension header.
+ */
+int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_get_mr_args))
+	 || rm->m_rdma_cookie != 0)
+		return -EINVAL;
+
+	return __rds_rdma_map(rs, CMSG_DATA(cmsg), &rm->m_rdma_cookie, &rm->m_rdma_mr);
+}
diff --git a/drivers/infiniband/ulp/rds/rdma.h b/drivers/infiniband/ulp/rds/rdma.h
new file mode 100644
index 0000000..4255120
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/rdma.h
@@ -0,0 +1,84 @@
+#ifndef _RDS_RDMA_H
+#define _RDS_RDMA_H
+
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/scatterlist.h>
+
+#include "rds.h"
+
+struct rds_mr {
+	struct rb_node		r_rb_node;
+	atomic_t		r_refcount;
+	u32			r_key;
+
+	/* A copy of the creation flags */
+	unsigned int		r_use_once:1;
+	unsigned int		r_invalidate:1;
+	unsigned int		r_write:1;
+
+	/* This is for RDS_MR_DEAD.
+	 * It would be nice & consistent to make this part of the above
+	 * bit field here, but we need to use test_and_set_bit.
+	 */
+	unsigned long		r_state;
+	struct rds_sock		*r_sock; /* back pointer to the socket that owns us */
+	struct rds_transport	*r_trans;
+	void			*r_trans_private;
+};
+
+/* Flags for mr->r_state */
+#define RDS_MR_DEAD		0
+
+struct rds_rdma_op {
+	u32			r_key;
+	u64			r_remote_addr;
+	unsigned int		r_write:1;
+	unsigned int		r_fence:1;
+	unsigned int		r_notify:1;
+	unsigned int		r_recverr:1;
+	unsigned int		r_mapped:1;
+	struct rds_notifier	*r_notifier;
+	unsigned int		r_bytes;
+	unsigned int		r_nents;
+	unsigned int		r_count;
+	struct scatterlist	r_sg[0];
+};
+
+static inline rds_rdma_cookie_t rds_rdma_make_cookie(u32 r_key, u32 offset)
+{
+	return r_key | (((u64) offset) << 32);
+}
+
+static inline u32 rds_rdma_cookie_key(rds_rdma_cookie_t cookie)
+{
+	return cookie;
+}
+
+static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
+{
+	return cookie >> 32;
+}
+
+int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen);
+int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen);
+void rds_rdma_drop_keys(struct rds_sock *rs);
+int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg);
+void rds_rdma_free_op(struct rds_rdma_op *ro);
+void rds_rdma_send_complete(struct rds_message *rm, int);
+
+extern void __rds_put_mr_final(struct rds_mr *mr);
+static inline void rds_mr_put(struct rds_mr *mr)
+{
+	if (atomic_dec_and_test(&mr->r_refcount))
+		__rds_put_mr_final(mr);
+}
+
+#endif
diff --git a/drivers/infiniband/ulp/rds/rds_rdma.h b/drivers/infiniband/ulp/rds/rds_rdma.h
new file mode 100644
index 0000000..565482d
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/rds_rdma.h
@@ -0,0 +1,245 @@
+/*
+ * Copyright (c) 2008 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef IB_RDS_H
+#define IB_RDS_H
+
+#include <linux/types.h>
+
+/* These sparse annotated types shouldn't be in any user
+ * visible header file. We should clean this up rather
+ * than kludging around them. */
+#ifndef __KERNEL__
+#define __be16	u_int16_t
+#define __be32	u_int32_t
+#define __be64	u_int64_t
+#endif
+
+#define RDS_IB_ABI_VERSION		0x301
+
+/*
+ * setsockopt/getsockopt for SOL_RDS
+ */
+#define RDS_CANCEL_SENT_TO      	1
+#define RDS_GET_MR			2
+#define RDS_FREE_MR			3
+/* deprecated: RDS_BARRIER 4 */
+#define RDS_RECVERR			5
+#define RDS_CONG_MONITOR		6
+
+/*
+ * Control message types for SOL_RDS.
+ *
+ * CMSG_RDMA_ARGS (sendmsg)
+ *	Request a RDMA transfer to/from the specified
+ *	memory ranges.
+ *	The cmsg_data is a struct rds_rdma_args.
+ * RDS_CMSG_RDMA_DEST (recvmsg, sendmsg)
+ *	Kernel informs application about intended
+ *	source/destination of a RDMA transfer
+ * RDS_CMSG_RDMA_MAP (sendmsg)
+ *	Application asks kernel to map the given
+ *	memory range into a IB MR, and send the
+ *	R_Key along in an RDS extension header.
+ *	The cmsg_data is a struct rds_get_mr_args,
+ *	the same as for the GET_MR setsockopt.
+ * RDS_CMSG_RDMA_STATUS (recvmsg)
+ *	Returns the status of a completed RDMA operation.
+ */
+#define RDS_CMSG_RDMA_ARGS		1
+#define RDS_CMSG_RDMA_DEST		2
+#define RDS_CMSG_RDMA_MAP		3
+#define RDS_CMSG_RDMA_STATUS		4
+#define RDS_CMSG_CONG_UPDATE		5
+
+#define RDS_INFO_COUNTERS		10000
+#define RDS_INFO_CONNECTIONS		10001
+/* 10002 aka RDS_INFO_FLOWS is deprecated */
+#define RDS_INFO_SEND_MESSAGES		10003
+#define RDS_INFO_RETRANS_MESSAGES       10004
+#define RDS_INFO_RECV_MESSAGES          10005
+#define RDS_INFO_SOCKETS                10006
+#define RDS_INFO_TCP_SOCKETS            10007
+#define RDS_INFO_IB_CONNECTIONS		10008
+#define RDS_INFO_CONNECTION_STATS	10009
+
+struct rds_info_counter {
+	u_int8_t	name[32];
+	u_int64_t	value;
+} __attribute__((packed));
+
+#define RDS_INFO_CONNECTION_FLAG_SENDING	0x01
+#define RDS_INFO_CONNECTION_FLAG_CONNECTING	0x02
+#define RDS_INFO_CONNECTION_FLAG_CONNECTED	0x04
+
+struct rds_info_connection {
+	u_int64_t	next_tx_seq;
+	u_int64_t	next_rx_seq;
+	__be32		laddr;
+	__be32		faddr;
+	u_int8_t	transport[15];		/* null term ascii */
+	u_int8_t	flags;
+} __attribute__((packed));
+
+struct rds_info_flow {
+	__be32		laddr;
+	__be32		faddr;
+	u_int32_t	bytes;
+	__be16		lport;
+	__be16		fport;
+} __attribute__((packed));
+
+#define RDS_INFO_MESSAGE_FLAG_ACK               0x01
+#define RDS_INFO_MESSAGE_FLAG_FAST_ACK          0x02
+
+struct rds_info_message {
+	u_int64_t	seq;
+	u_int32_t	len;
+	__be32		laddr;
+	__be32		faddr;
+	__be16		lport;
+	__be16		fport;
+	u_int8_t	flags;
+} __attribute__((packed));
+
+struct rds_info_socket {
+	u_int32_t	sndbuf;
+	__be32		bound_addr;
+	__be32		connected_addr;
+	__be16		bound_port;
+	__be16		connected_port;
+	u_int32_t	rcvbuf;
+	u_int64_t	inum;
+} __attribute__((packed));
+
+#define RDS_IB_GID_LEN	16
+struct rds_info_rdma_connection {
+	__be32		src_addr;
+	__be32		dst_addr;
+	uint8_t		src_gid[RDS_IB_GID_LEN];
+	uint8_t		dst_gid[RDS_IB_GID_LEN];
+
+	uint32_t	max_send_wr;
+	uint32_t	max_recv_wr;
+	uint32_t	max_send_sge;
+	uint32_t	rdma_mr_max;
+	uint32_t	rdma_mr_size;
+};
+
+/*
+ * Congestion monitoring.
+ * Congestion control in RDS happens at the host connection
+ * level by exchanging a bitmap marking congested ports.
+ * By default, a process sleeping in poll() is always woken
+ * up when the congestion map is updated.
+ * With explicit monitoring, an application can have more
+ * fine-grained control.
+ * The application installs a 64bit mask value in the socket,
+ * where each bit corresponds to a group of ports.
+ * When a congestion update arrives, RDS checks the set of
+ * ports that are now uncongested against the list bit mask
+ * installed in the socket, and if they overlap, we queue a
+ * cong_notification on the socket.
+ *
+ * To install the congestion monitor bitmask, use RDS_CONG_MONITOR
+ * with the 64bit mask.
+ * Congestion updates are received via RDS_CMSG_CONG_UPDATE
+ * control messages.
+ *
+ * The correspondence between bits and ports is
+ *	1 << (portnum % 64)
+ */
+#define RDS_CONG_MONITOR_SIZE	64
+#define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
+#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port))
+
+/*
+ * RDMA related types
+ */
+
+/*
+ * This encapsulates a remote memory location.
+ * In the current implementation, it contains the R_Key
+ * of the remote memory region, and the offset into it
+ * (so that the application does not have to worry about
+ * alignment).
+ */
+typedef u_int64_t	rds_rdma_cookie_t;
+
+struct rds_iovec {
+	u_int64_t	addr;
+	u_int64_t	bytes;
+};
+
+struct rds_get_mr_args {
+	struct rds_iovec vec;
+	u_int64_t	cookie_addr;
+	uint64_t	flags;
+};
+
+struct rds_free_mr_args {
+	rds_rdma_cookie_t cookie;
+	u_int64_t	flags;
+};
+
+struct rds_rdma_args {
+	rds_rdma_cookie_t cookie;
+	struct rds_iovec remote_vec;
+	u_int64_t	local_vec_addr;
+	u_int64_t	nr_local;
+	u_int64_t	flags;
+	u_int64_t	user_token;
+};
+
+struct rds_rdma_notify {
+	u_int64_t	user_token;
+	int32_t		status;
+};
+
+#define RDS_RDMA_SUCCESS	0
+#define RDS_RDMA_REMOTE_ERROR	1
+#define RDS_RDMA_CANCELED	2
+#define RDS_RDMA_DROPPED	3
+#define RDS_RDMA_OTHER_ERROR	4
+
+/*
+ * Common set of flags for all RDMA related structs
+ */
+#define RDS_RDMA_READWRITE	0x0001
+#define RDS_RDMA_FENCE		0x0002	/* use FENCE for immediate send */
+#define RDS_RDMA_INVALIDATE	0x0004	/* invalidate R_Key after freeing MR */
+#define RDS_RDMA_USE_ONCE	0x0008	/* free MR after use */
+#define RDS_RDMA_DONTWAIT	0x0010	/* Don't wait in SET_BARRIER */
+#define RDS_RDMA_NOTIFY_ME	0x0020	/* Notify when operation completes */
+
+#endif /* IB_RDS_H */
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [ofa-general] [PATCH 13/21] RDS/IB: Infiniband transport
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (11 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 12/21] RDS: RDMA support Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 14/21] RDS/IB: Ring-handling code Andy Grover
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: netdev, rds-devel, general

Registers as an RDS transport and an IB client, and uses IB CM
API to allocate ids, queue pairs, and the rest of that fun stuff.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib.c    |  312 +++++++++++++
 drivers/infiniband/ulp/rds/ib.h    |  358 +++++++++++++++
 drivers/infiniband/ulp/rds/ib_cm.c |  882 ++++++++++++++++++++++++++++++++++++
 3 files changed, 1552 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib.c
 create mode 100644 drivers/infiniband/ulp/rds/ib.h
 create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c

diff --git a/drivers/infiniband/ulp/rds/ib.c b/drivers/infiniband/ulp/rds/ib.c
new file mode 100644
index 0000000..cd35fba
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib.c
@@ -0,0 +1,312 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/if.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/if_arp.h>
+#include <linux/delay.h>
+
+#include "rds.h"
+#include "ib.h"
+
+unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE;
+unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned MRs */
+
+module_param(fmr_pool_size, int, 0444);
+MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA");
+module_param(fmr_message_size, int, 0444);
+MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer");
+
+struct list_head rds_ib_devices;
+
+DEFINE_SPINLOCK(ib_nodev_conns_lock);
+LIST_HEAD(ib_nodev_conns);
+
+void rds_ib_add_one(struct ib_device *device)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct ib_device_attr *dev_attr;
+
+	/* Only handle IB (no iWARP) devices */
+	if (device->node_type != RDMA_NODE_IB_CA)
+		return;
+
+	dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL);
+	if (!dev_attr)
+		return;
+
+	if (ib_query_device(device, dev_attr)) {
+		rdsdebug("Query device failed for %s\n", device->name);
+		goto free_attr;
+	}
+
+	rds_ibdev = kmalloc(sizeof *rds_ibdev, GFP_KERNEL);
+	if (!rds_ibdev)
+		goto free_attr;
+
+	spin_lock_init(&rds_ibdev->spinlock);
+
+	rds_ibdev->max_wrs = dev_attr->max_qp_wr;
+	rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
+
+	rds_ibdev->fmr_page_shift = max(9, ffs(dev_attr->page_size_cap) - 1);
+	rds_ibdev->fmr_page_size  = 1 << rds_ibdev->fmr_page_shift;
+	rds_ibdev->fmr_page_mask  = ~((u64) rds_ibdev->fmr_page_size - 1);
+	rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
+	rds_ibdev->max_fmrs = dev_attr->max_fmr ?
+			min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) :
+			fmr_pool_size;
+
+	rds_ibdev->dev = device;
+	rds_ibdev->pd = ib_alloc_pd(device);
+	if (IS_ERR(rds_ibdev->pd))
+		goto free_dev;
+
+	rds_ibdev->mr = ib_get_dma_mr(rds_ibdev->pd,
+				      IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(rds_ibdev->mr))
+		goto err_pd;
+
+	rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev);
+	if (IS_ERR(rds_ibdev->mr_pool)) {
+		rds_ibdev->mr_pool = NULL;
+		goto err_mr;
+	}
+
+	INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
+	INIT_LIST_HEAD(&rds_ibdev->conn_list);
+	list_add_tail(&rds_ibdev->list, &rds_ib_devices);
+
+	ib_set_client_data(device, &rds_ib_client, rds_ibdev);
+
+	goto free_attr;
+
+err_mr:
+	ib_dereg_mr(rds_ibdev->mr);
+err_pd:
+	ib_dealloc_pd(rds_ibdev->pd);
+free_dev:
+	kfree(rds_ibdev);
+free_attr:
+	kfree(dev_attr);
+}
+
+void rds_ib_remove_one(struct ib_device *device)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct rds_ib_ipaddr *i_ipaddr, *i_next;
+
+	rds_ibdev = ib_get_client_data(device, &rds_ib_client);
+	if (!rds_ibdev)
+		return;
+
+	list_for_each_entry_safe(i_ipaddr, i_next, &rds_ibdev->ipaddr_list, list) {
+		list_del(&i_ipaddr->list);
+		kfree(i_ipaddr);
+	}
+
+	rds_ib_remove_conns(rds_ibdev);
+
+	if (rds_ibdev->mr_pool)
+		rds_ib_destroy_mr_pool(rds_ibdev->mr_pool);
+
+	ib_dereg_mr(rds_ibdev->mr);
+
+	while (ib_dealloc_pd(rds_ibdev->pd)) {
+		rdsdebug("%s-%d Failed to dealloc pd %p\n", __func__, __LINE__, rds_ibdev->pd);
+		msleep(1);
+	}
+
+	list_del(&rds_ibdev->list);
+	kfree(rds_ibdev);
+}
+
+struct ib_client rds_ib_client = {
+	.name   = "rds_ib",
+	.add    = rds_ib_add_one,
+	.remove = rds_ib_remove_one
+};
+
+static int rds_ib_conn_info_visitor(struct rds_connection *conn,
+				    void *buffer)
+{
+	struct rds_info_rdma_connection *iinfo = buffer;
+	struct rds_ib_connection *ic;
+
+	/* We will only ever look at IB transports */
+	if (conn->c_trans != &rds_ib_transport)
+		return 0;
+
+	iinfo->src_addr = conn->c_laddr;
+	iinfo->dst_addr = conn->c_faddr;
+
+	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
+	memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
+	if (rds_conn_state(conn) == RDS_CONN_UP) {
+		struct rds_ib_device *rds_ibdev;
+		struct rdma_dev_addr *dev_addr;
+
+		ic = conn->c_transport_data;
+		dev_addr = &ic->i_cm_id->route.addr.dev_addr;
+
+		ib_addr_get_sgid(dev_addr, (union ib_gid *) &iinfo->src_gid);
+		ib_addr_get_dgid(dev_addr, (union ib_gid *) &iinfo->dst_gid);
+
+		rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client);
+		iinfo->max_send_wr = ic->i_send_ring.w_nr;
+		iinfo->max_recv_wr = ic->i_recv_ring.w_nr;
+		iinfo->max_send_sge = rds_ibdev->max_sge;
+		rds_ib_get_mr_info(rds_ibdev, iinfo);
+	}
+	return 1;
+}
+
+static void rds_ib_ic_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	rds_for_each_conn_info(sock, len, iter, lens,
+				rds_ib_conn_info_visitor,
+				sizeof(struct rds_info_rdma_connection));
+}
+
+
+/*
+ * Early RDS/IB was built to only bind to an address if there is an IPoIB
+ * device with that address set.
+ *
+ * If it were me, I'd advocate for something more flexible.  Sending and
+ * receiving should be device-agnostic.  Transports would try and maintain
+ * connections between peers who have messages queued.  Userspace would be
+ * allowed to influence which paths have priority.  We could call userspace
+ * asserting this policy "routing".
+ */
+static int rds_ib_laddr_check(__be32 addr)
+{
+	struct net_device *dev;
+	int ret;
+
+	dev = ip_dev_find(&init_net, addr);
+	if (dev && dev->type == ARPHRD_INFINIBAND) {
+		dev_put(dev);
+		ret = 0;
+	} else
+		ret = -EADDRNOTAVAIL;
+
+	rdsdebug("addr %u.%u.%u.%u ret %d\n", NIPQUAD(addr), ret);
+
+	return ret;
+}
+
+void rds_ib_exit(void)
+{
+	rds_info_deregister_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info);
+	rds_ib_listen_stop();
+	rds_ib_remove_nodev_conns();
+	ib_unregister_client(&rds_ib_client);
+	rds_ib_sysctl_exit();
+	rds_ib_recv_exit();
+	rds_trans_unregister(&rds_ib_transport);
+}
+
+struct rds_transport rds_ib_transport = {
+	.laddr_check		= rds_ib_laddr_check,
+	.xmit_complete		= rds_ib_xmit_complete,
+	.xmit			= rds_ib_xmit,
+	.xmit_cong_map		= NULL,
+	.xmit_rdma		= rds_ib_xmit_rdma,
+	.recv			= rds_ib_recv,
+	.conn_alloc		= rds_ib_conn_alloc,
+	.conn_free		= rds_ib_conn_free,
+	.conn_connect		= rds_ib_conn_connect,
+	.conn_shutdown		= rds_ib_conn_shutdown,
+	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
+	.inc_purge		= rds_ib_inc_purge,
+	.inc_free		= rds_ib_inc_free,
+	.listen_stop		= rds_ib_listen_stop,
+	.stats_info_copy	= rds_ib_stats_info_copy,
+	.exit			= rds_ib_exit,
+	.get_mr			= rds_ib_get_mr,
+	.sync_mr		= rds_ib_sync_mr,
+	.free_mr		= rds_ib_free_mr,
+	.flush_mrs		= rds_ib_flush_mrs,
+	.t_owner		= THIS_MODULE,
+	.t_name			= "infiniband",
+};
+
+int __init rds_ib_init(void)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&rds_ib_devices);
+
+	ret = ib_register_client(&rds_ib_client);
+	if (ret)
+		goto out;
+
+	ret = rds_ib_sysctl_init();
+	if (ret)
+		goto out_ibreg;
+
+	ret = rds_ib_recv_init();
+	if (ret)
+		goto out_sysctl;
+
+	ret = rds_trans_register(&rds_ib_transport);
+	if (ret)
+		goto out_recv;
+
+	ret = rds_ib_listen_init();
+	if (ret)
+		goto out_register;
+
+	rds_info_register_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info);
+
+	goto out;
+
+out_register:
+	rds_trans_unregister(&rds_ib_transport);
+out_recv:
+	rds_ib_recv_exit();
+out_sysctl:
+	rds_ib_sysctl_exit();
+out_ibreg:
+	ib_unregister_client(&rds_ib_client);
+out:
+	return ret;
+}
+
+MODULE_LICENSE("GPL");
+
diff --git a/drivers/infiniband/ulp/rds/ib.h b/drivers/infiniband/ulp/rds/ib.h
new file mode 100644
index 0000000..eff70f0
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib.h
@@ -0,0 +1,358 @@
+#ifndef _RDS_IB_H
+#define _RDS_IB_H
+
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include "rds.h"
+
+#define RDS_IB_RESOLVE_TIMEOUT_MS	5000
+
+#define RDS_FMR_SIZE			256
+#define RDS_FMR_POOL_SIZE		4096
+
+#define RDS_IB_MAX_SGE			8
+#define RDS_IB_RECV_SGE 		2
+
+#define RDS_IB_DEFAULT_RECV_WR		1024
+#define RDS_IB_DEFAULT_SEND_WR		256
+
+#define RDS_IB_SUPPORTED_PROTOCOLS	0x00000003	/* minor versions supported */
+
+extern struct list_head rds_ib_devices;
+
+/*
+ * IB posts RDS_FRAG_SIZE fragments of pages to the receive queues to
+ * try and minimize the amount of memory tied up both the device and
+ * socket receive queues.
+ */
+/* page offset of the final full frag that fits in the page */
+#define RDS_PAGE_LAST_OFF (((PAGE_SIZE  / RDS_FRAG_SIZE) - 1) * RDS_FRAG_SIZE)
+struct rds_page_frag {
+	struct list_head	f_item;
+	struct page		*f_page;
+	unsigned long		f_offset;
+	dma_addr_t 		f_mapped;
+};
+
+struct rds_ib_incoming {
+	struct list_head	ii_frags;
+	struct rds_incoming	ii_inc;
+};
+
+struct rds_ib_connect_private {
+	/* Add new fields at the end, and don't permute existing fields. */
+	__be32			dp_saddr;
+	__be32			dp_daddr;
+	u8			dp_protocol_major;
+	u8			dp_protocol_minor;
+	__be16			dp_protocol_minor_mask; /* bitmask */
+	__be32			dp_reserved1;
+	__be64			dp_ack_seq;
+	__be32			dp_credit;		/* non-zero enables flow ctl */
+};
+
+struct rds_ib_send_work {
+	struct rds_message	*s_rm;
+	struct rds_rdma_op	*s_op;
+	struct ib_send_wr	s_wr;
+	struct ib_sge		s_sge[RDS_IB_MAX_SGE];
+	unsigned long		s_queued;
+};
+
+struct rds_ib_recv_work {
+	struct rds_ib_incoming 	*r_ibinc;
+	struct rds_page_frag	*r_frag;
+	struct ib_recv_wr	r_wr;
+	struct ib_sge		r_sge[2];
+};
+
+struct rds_ib_work_ring {
+	u32		w_nr;
+	u32		w_alloc_ptr;
+	u32		w_alloc_ctr;
+	u32		w_free_ptr;
+	atomic_t	w_free_ctr;
+};
+
+struct rds_ib_device;
+
+struct rds_ib_connection {
+
+	struct list_head	ib_node;
+	struct rds_ib_device	*rds_ibdev;
+	struct rds_connection	*conn;
+
+	/* alphabet soup, IBTA style */
+	struct rdma_cm_id	*i_cm_id;
+	struct ib_pd		*i_pd;
+	struct ib_mr		*i_mr;
+	struct ib_cq		*i_send_cq;
+	struct ib_cq		*i_recv_cq;
+
+	/* tx */
+	struct rds_ib_work_ring	i_send_ring;
+	struct rds_message	*i_rm;
+	struct rds_header	*i_send_hdrs;
+	u64			i_send_hdrs_dma;
+	struct rds_ib_send_work *i_sends;
+
+	/* rx */
+	struct mutex		i_recv_mutex;
+	struct rds_ib_work_ring	i_recv_ring;
+	struct rds_ib_incoming	*i_ibinc;
+	u32			i_recv_data_rem;
+	struct rds_header	*i_recv_hdrs;
+	u64			i_recv_hdrs_dma;
+	struct rds_ib_recv_work *i_recvs;
+	struct rds_page_frag	i_frag;
+	u64			i_ack_recv;	/* last ACK received */
+
+	/* sending acks */
+	unsigned long		i_ack_flags;
+#ifndef KERNEL_HAS_ATOMIC64
+	spinlock_t		i_ack_lock;	/* protect i_ack_next */
+	u64			i_ack_next;	/* next ACK to send */
+#else
+	atomic64_t		i_ack_next;	/* next ACK to send */
+#endif
+	struct rds_header	*i_ack;
+	struct ib_send_wr	i_ack_wr;
+	struct ib_sge		i_ack_sge;
+	u64			i_ack_dma;
+	unsigned long		i_ack_queued;
+
+	/* Flow control related information
+	 *
+	 * Our algorithm uses a pair variables that we need to access
+	 * atomically - one for the send credits, and one posted
+	 * recv credits we need to transfer to remote.
+	 * Rather than protect them using a slow spinlock, we put both into
+	 * a single atomic_t and update it using cmpxchg
+	 */
+	atomic_t		i_credits;
+
+	/* Protocol version specific information */
+	unsigned int		i_flowctl:1;	/* enable/disable flow ctl */
+
+	/* Batched completions */
+	unsigned int		i_unsignaled_wrs;
+	long			i_unsignaled_bytes;
+};
+
+/* This assumes that atomic_t is at least 32 bits */
+#define IB_GET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_GET_POST_CREDITS(v)	((v) >> 16)
+#define IB_SET_SEND_CREDITS(v)	((v) & 0xffff)
+#define IB_SET_POST_CREDITS(v)	((v) << 16)
+
+struct rds_ib_ipaddr {
+	struct list_head	list;
+	__be32			ipaddr;
+};
+
+struct rds_ib_device {
+	struct list_head	list;
+	struct list_head	ipaddr_list;
+	struct list_head	conn_list;
+	struct ib_device	*dev;
+	struct ib_pd		*pd;
+	struct ib_mr		*mr;
+	struct rds_ib_mr_pool	*mr_pool;
+	int			fmr_page_shift;
+	int			fmr_page_size;
+	u64			fmr_page_mask;
+	unsigned int		fmr_max_remaps;
+	unsigned int		max_fmrs;
+	int			max_sge;
+	unsigned int		max_wrs;
+	spinlock_t		spinlock;	/* protect the above */
+};
+
+/* bits for i_ack_flags */
+#define IB_ACK_IN_FLIGHT	0
+#define IB_ACK_REQUESTED	1
+
+/* Magic WR_ID for ACKs */
+#define RDS_IB_ACK_WR_ID	(~(u64) 0)
+
+struct rds_ib_statistics {
+	uint64_t	s_ib_connect_raced;
+	uint64_t	s_ib_listen_closed_stale;
+	uint64_t	s_ib_tx_cq_call;
+	uint64_t	s_ib_tx_cq_event;
+	uint64_t	s_ib_tx_ring_full;
+	uint64_t	s_ib_tx_throttle;
+	uint64_t	s_ib_tx_sg_mapping_failure;
+	uint64_t	s_ib_tx_stalled;
+	uint64_t	s_ib_tx_credit_updates;
+	uint64_t	s_ib_rx_cq_call;
+	uint64_t	s_ib_rx_cq_event;
+	uint64_t	s_ib_rx_ring_empty;
+	uint64_t	s_ib_rx_refill_from_cq;
+	uint64_t	s_ib_rx_refill_from_thread;
+	uint64_t	s_ib_rx_alloc_limit;
+	uint64_t	s_ib_rx_credit_updates;
+	uint64_t	s_ib_ack_sent;
+	uint64_t	s_ib_ack_send_failure;
+	uint64_t	s_ib_ack_send_delayed;
+	uint64_t	s_ib_ack_send_piggybacked;
+	uint64_t	s_ib_ack_received;
+	uint64_t	s_ib_rdma_mr_alloc;
+	uint64_t	s_ib_rdma_mr_free;
+	uint64_t	s_ib_rdma_mr_used;
+	uint64_t	s_ib_rdma_mr_pool_flush;
+	uint64_t	s_ib_rdma_mr_pool_wait;
+	uint64_t	s_ib_rdma_mr_pool_depleted;
+};
+
+extern struct workqueue_struct *rds_ib_wq;
+
+/*
+ * Fake ib_dma_sync_sg_for_{cpu,device} as long as ib_verbs.h
+ * doesn't define it.
+ */
+static inline void rds_ib_dma_sync_sg_for_cpu(struct ib_device *dev,
+		struct scatterlist *sg, unsigned int sg_dma_len, int direction)
+{
+	unsigned int i;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		ib_dma_sync_single_for_cpu(dev,
+				ib_sg_dma_address(dev, &sg[i]),
+				ib_sg_dma_len(dev, &sg[i]),
+				direction);
+	}
+}
+#define ib_dma_sync_sg_for_cpu	rds_ib_dma_sync_sg_for_cpu
+
+static inline void rds_ib_dma_sync_sg_for_device(struct ib_device *dev,
+		struct scatterlist *sg, unsigned int sg_dma_len, int direction)
+{
+	unsigned int i;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		ib_dma_sync_single_for_device(dev,
+				ib_sg_dma_address(dev, &sg[i]),
+				ib_sg_dma_len(dev, &sg[i]),
+				direction);
+	}
+}
+#define ib_dma_sync_sg_for_device	rds_ib_dma_sync_sg_for_device
+
+
+/* ib.c */
+extern struct rds_transport rds_ib_transport;
+extern void rds_ib_add_one(struct ib_device *device);
+extern void rds_ib_remove_one(struct ib_device *device);
+extern struct ib_client rds_ib_client;
+
+extern unsigned int fmr_pool_size;
+extern unsigned int fmr_message_size;
+
+extern spinlock_t ib_nodev_conns_lock;
+extern struct list_head ib_nodev_conns;
+
+/* ib_cm.c */
+int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp);
+void rds_ib_conn_free(void *arg);
+int rds_ib_conn_connect(struct rds_connection *conn);
+void rds_ib_conn_shutdown(struct rds_connection *conn);
+void rds_ib_state_change(struct sock *sk);
+int __init rds_ib_listen_init(void);
+void rds_ib_listen_stop(void);
+void __rds_ib_conn_error(struct rds_connection *conn, const char *, ...);
+
+#define rds_ib_conn_error(conn, fmt...) \
+	__rds_ib_conn_error(conn, KERN_WARNING "RDS/IB: " fmt)
+
+/* ib_rdma.c */
+int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr);
+int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn);
+void rds_ib_remove_nodev_conns(void);
+void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev);
+struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *);
+void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo);
+void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *);
+void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
+		    struct rds_sock *rs, u32 *key_ret);
+void rds_ib_sync_mr(void *trans_private, int dir);
+void rds_ib_free_mr(void *trans_private, int invalidate);
+void rds_ib_flush_mrs(void);
+
+/* ib_recv.c */
+int __init rds_ib_recv_init(void);
+void rds_ib_recv_exit(void);
+int rds_ib_recv(struct rds_connection *conn);
+int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
+		       gfp_t page_gfp, int prefill);
+void rds_ib_inc_purge(struct rds_incoming *inc);
+void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov,
+			     size_t size);
+void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_recv_init_ring(struct rds_ib_connection *ic);
+void rds_ib_recv_clear_ring(struct rds_ib_connection *ic);
+void rds_ib_recv_init_ack(struct rds_ib_connection *ic);
+void rds_ib_attempt_ack(struct rds_ib_connection *ic);
+void rds_ib_ack_send_complete(struct rds_ib_connection *ic);
+u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic);
+
+/* ib_ring.c */
+void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr);
+void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr);
+u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos);
+void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val);
+void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val);
+int rds_ib_ring_empty(struct rds_ib_work_ring *ring);
+int rds_ib_ring_low(struct rds_ib_work_ring *ring);
+u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring);
+u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest);
+extern wait_queue_head_t rds_ib_ring_empty_wait;
+
+/* ib_send.c */
+void rds_ib_xmit_complete(struct rds_connection *conn);
+int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
+		unsigned int hdr_off, unsigned int sg, unsigned int off);
+void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_send_init_ring(struct rds_ib_connection *ic);
+void rds_ib_send_clear_ring(struct rds_ib_connection *ic);
+int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op);
+void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits);
+void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted);
+int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted,
+			     u32 *adv_credits);
+
+/* ib_stats.c */
+DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
+#define rds_ib_stats_inc(member) rds_stats_inc_which(rds_ib_stats, member)
+unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter,
+				    unsigned int avail);
+
+/* ib_sysctl.c */
+int __init rds_ib_sysctl_init(void);
+void rds_ib_sysctl_exit(void);
+extern unsigned long rds_ib_sysctl_max_send_wr;
+extern unsigned long rds_ib_sysctl_max_recv_wr;
+extern unsigned long rds_ib_sysctl_max_unsig_wrs;
+extern unsigned long rds_ib_sysctl_max_unsig_bytes;
+extern unsigned long rds_ib_sysctl_max_recv_allocation;
+extern unsigned int rds_ib_sysctl_flow_control;
+extern ctl_table rds_ib_sysctl_table[];
+
+/*
+ * Helper functions for getting/setting the header and data SGEs in
+ * RDS packets (not RDMA)
+ */
+static inline struct ib_sge *
+rds_ib_header_sge(struct rds_ib_connection *ic, struct ib_sge *sge)
+{
+	return &sge[0];
+}
+
+static inline struct ib_sge *
+rds_ib_data_sge(struct rds_ib_connection *ic, struct ib_sge *sge)
+{
+	return &sge[1];
+}
+
+#endif
diff --git a/drivers/infiniband/ulp/rds/ib_cm.c b/drivers/infiniband/ulp/rds/ib_cm.c
new file mode 100644
index 0000000..870070d
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_cm.c
@@ -0,0 +1,882 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/vmalloc.h>
+
+#include "rds.h"
+#include "ib.h"
+
+static struct rdma_cm_id *rds_ib_listen_id;
+
+/*
+ * Set the selected protocol version
+ */
+static void rds_ib_set_protocol(struct rds_connection *conn, unsigned int version)
+{
+	conn->c_version = version;
+}
+
+/*
+ * Set up flow control
+ */
+static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (rds_ib_sysctl_flow_control && credits != 0) {
+		/* We're doing flow control */
+		ic->i_flowctl = 1;
+		rds_ib_send_add_credits(conn, credits);
+	} else {
+		ic->i_flowctl = 0;
+	}
+}
+
+/*
+ * Tune RNR behavior. Without flow control, we use a rather
+ * low timeout, but not the absolute minimum - this should
+ * be tunable.
+ *
+ * We already set the RNR retry count to 7 (which is the
+ * smallest infinite number :-) above.
+ * If flow control is off, we want to change this back to 0
+ * so that we learn quickly when our credit accounting is
+ * buggy.
+ *
+ * Caller passes in a qp_attr pointer - don't waste stack spacv
+ * by allocation this twice.
+ */
+static void
+rds_ib_tune_rnr(struct rds_ib_connection *ic, struct ib_qp_attr *attr)
+{
+	int ret;
+
+	if (ic->i_flowctl) {
+		/* It seems we have to take a brief detour through SQD state
+		 * in order to change the RNR retry count. */
+		attr->qp_state = IB_QPS_SQD;
+		ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_STATE);
+		if (ret)
+			printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, SQD): err=%d\n", -ret);
+
+		attr->rnr_retry = 0;
+		ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_RNR_RETRY);
+		if (ret)
+			printk(KERN_NOTICE "ib_modify_qp(IB_QP_RNR_RETRY, 0): err=%d\n", -ret);
+	} else {
+		attr->min_rnr_timer = IB_RNR_TIMER_000_32;
+		ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_MIN_RNR_TIMER);
+		if (ret)
+			printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): err=%d\n", -ret);
+	}
+}
+
+/*
+ * Connection established.
+ * We get here for both outgoing and incoming connection.
+ */
+static void rds_ib_connect_complete(struct rds_connection *conn, struct rdma_cm_event *event)
+{
+	const struct rds_ib_connect_private *dp = NULL;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_device *rds_ibdev;
+	struct ib_qp_attr qp_attr;
+	int err;
+
+	if (event->param.conn.private_data_len) {
+		dp = event->param.conn.private_data;
+
+		rds_ib_set_protocol(conn,
+				RDS_PROTOCOL(dp->dp_protocol_major,
+					dp->dp_protocol_minor));
+		rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+	}
+
+	printk(KERN_NOTICE "RDS/IB: connected to %u.%u.%u.%u version %u.%u%s\n",
+			NIPQUAD(conn->c_laddr),
+			RDS_PROTOCOL_MAJOR(conn->c_version),
+			RDS_PROTOCOL_MINOR(conn->c_version),
+			ic->i_flowctl ? ", flow control" : "");
+
+	/* Tune RNR behavior */
+	rds_ib_tune_rnr(ic, &qp_attr);
+
+	qp_attr.qp_state = IB_QPS_RTS;
+	err = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE);
+	if (err)
+		printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", err);
+
+	/* update ib_device with this local ipaddr & conn */
+	rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client);
+	err = rds_ib_update_ipaddr(rds_ibdev, conn->c_laddr);
+	if (err)
+		printk(KERN_ERR "rds_ib_update_ipaddr failed (%d)\n", err);
+	err = rds_ib_add_conn(rds_ibdev, conn);
+	if (err)
+		printk(KERN_ERR "rds_ib_add_conn failed (%d)\n", err);
+
+	/* If the peer gave us the last packet it saw, process this as if
+	 * we had received a regular ACK. */
+	if (dp && dp->dp_ack_seq)
+		rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL);
+
+	rds_connect_complete(conn);
+}
+
+static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
+			struct rdma_conn_param *conn_param,
+			struct rds_ib_connect_private *dp,
+			u32 protocol_version)
+{
+	memset(conn_param, 0, sizeof(struct rdma_conn_param));
+	/* XXX tune these? */
+	conn_param->responder_resources = 1;
+	conn_param->initiator_depth = 1;
+	conn_param->retry_count = 7;
+	conn_param->rnr_retry_count = 7;
+
+	if (dp) {
+		struct rds_ib_connection *ic = conn->c_transport_data;
+
+		memset(dp, 0, sizeof(*dp));
+		dp->dp_saddr = conn->c_laddr;
+		dp->dp_daddr = conn->c_faddr;
+		dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
+		dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
+		dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
+		dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
+
+		/* Advertise flow control */
+		if (ic->i_flowctl) {
+			unsigned int credits;
+
+			credits = IB_GET_POST_CREDITS(atomic_read(&ic->i_credits));
+			dp->dp_credit = cpu_to_be32(credits);
+			atomic_sub(IB_SET_POST_CREDITS(credits), &ic->i_credits);
+		}
+
+		conn_param->private_data = dp;
+		conn_param->private_data_len = sizeof(*dp);
+	}
+}
+
+static void rds_ib_cq_event_handler(struct ib_event *event, void *data)
+{
+	rdsdebug("event %u data %p\n", event->event, data);
+}
+
+static void rds_ib_qp_event_handler(struct ib_event *event, void *data)
+{
+	struct rds_connection *conn = data;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	rdsdebug("conn %p ic %p event %u\n", conn, ic, event->event);
+
+	switch (event->event) {
+	case IB_EVENT_COMM_EST:
+		rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST);
+		break;
+	default:
+		printk(KERN_WARNING "RDS/ib: unhandled QP event %u "
+		       "on connection to %u.%u.%u.%u\n", event->event,
+		       NIPQUAD(conn->c_faddr));
+		break;
+	}
+}
+
+/*
+ * This needs to be very careful to not leave IS_ERR pointers around for
+ * cleanup to trip over.
+ */
+static int rds_ib_setup_qp(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_device *dev = ic->i_cm_id->device;
+	struct ib_qp_init_attr attr;
+	struct rds_ib_device *rds_ibdev;
+	int ret;
+
+	/* rds_ib_add_one creates a rds_ib_device object per IB device,
+	 * and allocates a protection domain, memory range and FMR pool
+	 * for each.  If that fails for any reason, it will not register
+	 * the rds_ibdev at all.
+	 */
+	rds_ibdev = ib_get_client_data(dev, &rds_ib_client);
+	if (rds_ibdev == NULL) {
+		if (printk_ratelimit())
+			printk(KERN_NOTICE "RDS/IB: No client_data for device %s\n",
+					dev->name);
+		return -EOPNOTSUPP;
+	}
+
+	if (rds_ibdev->max_wrs < ic->i_send_ring.w_nr + 1)
+		rds_ib_ring_resize(&ic->i_send_ring, rds_ibdev->max_wrs - 1);
+	if (rds_ibdev->max_wrs < ic->i_recv_ring.w_nr + 1)
+		rds_ib_ring_resize(&ic->i_recv_ring, rds_ibdev->max_wrs - 1);
+
+	/* Protection domain and memory range */
+	ic->i_pd = rds_ibdev->pd;
+	ic->i_mr = rds_ibdev->mr;
+
+	ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler,
+				     rds_ib_cq_event_handler, conn,
+				     ic->i_send_ring.w_nr + 1, 0);
+	if (IS_ERR(ic->i_send_cq)) {
+		ret = PTR_ERR(ic->i_send_cq);
+		ic->i_send_cq = NULL;
+		rdsdebug("ib_create_cq send failed: %d\n", ret);
+		goto out;
+	}
+
+	ic->i_recv_cq = ib_create_cq(dev, rds_ib_recv_cq_comp_handler,
+				     rds_ib_cq_event_handler, conn,
+				     ic->i_recv_ring.w_nr, 0);
+	if (IS_ERR(ic->i_recv_cq)) {
+		ret = PTR_ERR(ic->i_recv_cq);
+		ic->i_recv_cq = NULL;
+		rdsdebug("ib_create_cq recv failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
+	if (ret) {
+		rdsdebug("ib_req_notify_cq recv failed: %d\n", ret);
+		goto out;
+	}
+
+	/* XXX negotiate max send/recv with remote? */
+	memset(&attr, 0, sizeof(attr));
+	attr.event_handler = rds_ib_qp_event_handler;
+	attr.qp_context = conn;
+	/* + 1 to allow for the single ack message */
+	attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+	attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
+	attr.cap.max_send_sge = rds_ibdev->max_sge;
+	attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
+	attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+	attr.qp_type = IB_QPT_RC;
+	attr.send_cq = ic->i_send_cq;
+	attr.recv_cq = ic->i_recv_cq;
+
+	/*
+	 * XXX this can fail if max_*_wr is too large?  Are we supposed
+	 * to back off until we get a value that the hardware can support?
+	 */
+	ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr);
+	if (ret) {
+		rdsdebug("rdma_create_qp failed: %d\n", ret);
+		goto out;
+	}
+
+	ic->i_send_hdrs = ib_dma_alloc_coherent(dev,
+					   ic->i_send_ring.w_nr *
+						sizeof(struct rds_header),
+					   &ic->i_send_hdrs_dma, GFP_KERNEL);
+	if (ic->i_send_hdrs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent send failed\n");
+		goto out;
+	}
+
+	ic->i_recv_hdrs = ib_dma_alloc_coherent(dev,
+					   ic->i_recv_ring.w_nr *
+						sizeof(struct rds_header),
+					   &ic->i_recv_hdrs_dma, GFP_KERNEL);
+	if (ic->i_recv_hdrs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent recv failed\n");
+		goto out;
+	}
+
+	ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
+				       &ic->i_ack_dma, GFP_KERNEL);
+	if (ic->i_ack == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("ib_dma_alloc_coherent ack failed\n");
+		goto out;
+	}
+
+	ic->i_sends = vmalloc(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work));
+	if (ic->i_sends == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("send allocation failed\n");
+		goto out;
+	}
+	rds_ib_send_init_ring(ic);
+
+	ic->i_recvs = vmalloc(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work));
+	if (ic->i_recvs == NULL) {
+		ret = -ENOMEM;
+		rdsdebug("recv allocation failed\n");
+		goto out;
+	}
+
+	rds_ib_recv_init_ring(ic);
+	rds_ib_recv_init_ack(ic);
+
+	/* Post receive buffers - as a side effect, this will update
+	 * the posted credit count. */
+	rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1);
+
+	rdsdebug("conn %p pd %p mr %p cq %p %p\n", conn, ic->i_pd, ic->i_mr,
+		 ic->i_send_cq, ic->i_recv_cq);
+
+out:
+	return ret;
+}
+
+static u32 rds_ib_protocol_compatible(const struct rds_ib_connect_private *dp)
+{
+	u16 common;
+	u32 version = 0;
+
+	/* rdma_cm private data is odd - when there is any private data in the
+	 * request, we will be given a pretty large buffer without telling us the
+	 * original size. The only way to tell the difference is by looking at
+	 * the contents, which are initialized to zero.
+	 * If the protocol version fields aren't set, this is a connection attempt
+	 * from an older version. This could could be 3.0 or 2.0 - we can't tell.
+	 * We really should have changed this for OFED 1.3 :-( */
+	if (dp->dp_protocol_major == 0)
+		return RDS_PROTOCOL_3_0;
+
+	common = be16_to_cpu(dp->dp_protocol_minor_mask) & RDS_IB_SUPPORTED_PROTOCOLS;
+	if (dp->dp_protocol_major == 3 && common) {
+		version = RDS_PROTOCOL_3_0;
+		while ((common >>= 1) != 0)
+			version++;
+	} else if (printk_ratelimit()) {
+		printk(KERN_NOTICE "RDS: Connection from %u.%u.%u.%u using "
+			"incompatible protocol version %u.%u\n",
+			NIPQUAD(dp->dp_saddr),
+			dp->dp_protocol_major,
+			dp->dp_protocol_minor);
+	}
+	return version;
+}
+
+static int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
+				    struct rdma_cm_event *event)
+{
+	__be64 lguid = cm_id->route.path_rec->sgid.global.interface_id;
+	__be64 fguid = cm_id->route.path_rec->dgid.global.interface_id;
+	const struct rds_ib_connect_private *dp = event->param.conn.private_data;
+	struct rds_ib_connect_private dp_rep;
+	struct rds_connection *conn = NULL;
+	struct rds_ib_connection *ic = NULL;
+	struct rdma_conn_param conn_param;
+	u32 version;
+	int err, destroy = 1;
+
+	/* Check whether the remote protocol version matches ours. */
+	version = rds_ib_protocol_compatible(dp);
+	if (!version)
+		goto out;
+
+	rdsdebug("saddr %u.%u.%u.%u daddr %u.%u.%u.%u RDSv%u.%u lguid 0x%llx fguid "
+		 "0x%llx\n", NIPQUAD(dp->dp_saddr), NIPQUAD(dp->dp_daddr),
+		 RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version),
+		 (unsigned long long)be64_to_cpu(lguid),
+		 (unsigned long long)be64_to_cpu(fguid));
+
+	conn = rds_conn_create(dp->dp_daddr, dp->dp_saddr, &rds_ib_transport,
+			       GFP_KERNEL);
+	if (IS_ERR(conn)) {
+		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
+		conn = NULL;
+		goto out;
+	}
+	ic = conn->c_transport_data;
+
+	rds_ib_set_protocol(conn, version);
+	rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit));
+
+	/* If the peer gave us the last packet it saw, process this as if
+	 * we had received a regular ACK. */
+	if (dp->dp_ack_seq)
+		rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL);
+
+	/*
+	 * The connection request may occur while the
+	 * previous connection exist, e.g. in case of failover.
+	 * But as connections may be initiated simultaneously
+	 * by both hosts, we have a random backoff mechanism -
+	 * see the comment above rds_queue_reconnect()
+	 */
+	if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) {
+		if (rds_conn_state(conn) == RDS_CONN_UP) {
+			rdsdebug("incoming connect while connecting\n");
+			rds_conn_drop(conn);
+			rds_ib_stats_inc(s_ib_listen_closed_stale);
+		} else
+		if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
+			/* Wait and see - our connect may still be succeeding */
+			rds_ib_stats_inc(s_ib_connect_raced);
+		}
+		goto out;
+	}
+
+	BUG_ON(cm_id->context);
+	BUG_ON(ic->i_cm_id);
+
+	ic->i_cm_id = cm_id;
+	cm_id->context = conn;
+
+	/* We got halfway through setting up the ib_connection, if we
+	 * fail now, we have to take the long route out of this mess. */
+	destroy = 0;
+
+	err = rds_ib_setup_qp(conn);
+	if (err) {
+		rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", err);
+		goto out;
+	}
+
+	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp_rep, version);
+
+	/* rdma_accept() calls rdma_reject() internally if it fails */
+	err = rdma_accept(cm_id, &conn_param);
+	if (err) {
+		rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);
+		goto out;
+	}
+
+	return 0;
+
+out:
+	rdma_reject(cm_id, NULL, 0);
+	return destroy;
+}
+
+
+static int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
+{
+	struct rds_connection *conn = cm_id->context;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rdma_conn_param conn_param;
+	struct rds_ib_connect_private dp;
+	int ret;
+
+	/* If the peer doesn't do protocol negotiation, we must
+	 * default to RDSv3.0 */
+	rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0);
+	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
+
+	ret = rds_ib_setup_qp(conn);
+	if (ret) {
+		rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", ret);
+		goto out;
+	}
+
+	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION);
+
+	ret = rdma_connect(cm_id, &conn_param);
+	if (ret)
+		rds_ib_conn_error(conn, "rdma_connect failed (%d)\n", ret);
+
+out:
+	/* Beware - returning non-zero tells the rdma_cm to destroy
+	 * the cm_id. We should certainly not do it as long as we still
+	 * "own" the cm_id. */
+	if (ret) {
+		if (ic->i_cm_id == cm_id)
+			ret = 0;
+	}
+	return ret;
+}
+
+static int rds_ib_cm_event_handler(struct rdma_cm_id *cm_id,
+				   struct rdma_cm_event *event)
+{
+	/* this can be null in the listening path */
+	struct rds_connection *conn = cm_id->context;
+	int ret = 0;
+
+	rdsdebug("conn %p id %p handling event %u\n", conn, cm_id,
+		 event->event);
+
+	/* Prevent shutdown from tearing down the connection
+	 * while we're executing. */
+	if (conn) {
+		mutex_lock(&conn->c_cm_lock);
+
+		/* If the connection is being shut down, bail out
+		 * right away. We return 0 so cm_id doesn't get
+		 * destroyed prematurely */
+		if (rds_conn_state(conn) == RDS_CONN_DISCONNECTING) {
+			/* Reject incoming connections while we're tearing
+			 * down an existing one. */
+			if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST)
+				ret = 1;
+			goto out;
+		}
+	}
+
+	switch (event->event) {
+	case RDMA_CM_EVENT_CONNECT_REQUEST:
+		ret = rds_ib_cm_handle_connect(cm_id, event);
+		break;
+
+	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		/* XXX do we need to clean up if this fails? */
+		ret = rdma_resolve_route(cm_id,
+					 RDS_IB_RESOLVE_TIMEOUT_MS);
+		break;
+
+	case RDMA_CM_EVENT_ROUTE_RESOLVED:
+		/* XXX worry about racing with listen acceptance */
+		ret = rds_ib_cm_initiate_connect(cm_id);
+		break;
+
+	case RDMA_CM_EVENT_ESTABLISHED:
+		rds_ib_connect_complete(conn, event);
+		break;
+
+	case RDMA_CM_EVENT_ADDR_ERROR:
+	case RDMA_CM_EVENT_ROUTE_ERROR:
+	case RDMA_CM_EVENT_CONNECT_ERROR:
+	case RDMA_CM_EVENT_UNREACHABLE:
+	case RDMA_CM_EVENT_REJECTED:
+	case RDMA_CM_EVENT_DEVICE_REMOVAL:
+	case RDMA_CM_EVENT_ADDR_CHANGE:
+		if (conn)
+			rds_conn_drop(conn);
+		break;
+
+	case RDMA_CM_EVENT_DISCONNECTED:
+		rds_conn_drop(conn);
+		break;
+
+	default:
+		/* things like device disconnect? */
+		printk(KERN_ERR "unknown event %u\n", event->event);
+		BUG();
+		break;
+	}
+
+out:
+	if (conn) {
+		struct rds_ib_connection *ic = conn->c_transport_data;
+
+		/* If we return non-zero, we must to hang on to the cm_id */
+		BUG_ON(ic->i_cm_id == cm_id && ret);
+
+		mutex_unlock(&conn->c_cm_lock);
+	}
+
+	rdsdebug("id %p event %u handling ret %d\n", cm_id, event->event, ret);
+
+	return ret;
+}
+
+int rds_ib_conn_connect(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct sockaddr_in src, dest;
+	int ret;
+
+	/* XXX I wonder what affect the port space has */
+	ic->i_cm_id = rdma_create_id(rds_ib_cm_event_handler, conn,
+				     RDMA_PS_TCP);
+	if (IS_ERR(ic->i_cm_id)) {
+		ret = PTR_ERR(ic->i_cm_id);
+		ic->i_cm_id = NULL;
+		rdsdebug("rdma_create_id() failed: %d\n", ret);
+		goto out;
+	}
+
+	rdsdebug("created cm id %p for conn %p\n", ic->i_cm_id, conn);
+
+	src.sin_family = AF_INET;
+	src.sin_addr.s_addr = (__force u32)conn->c_laddr;
+	src.sin_port = (__force u16)htons(0);
+
+	dest.sin_family = AF_INET;
+	dest.sin_addr.s_addr = (__force u32)conn->c_faddr;
+	dest.sin_port = (__force u16)htons(RDS_PORT);
+
+	ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src,
+				(struct sockaddr *)&dest,
+				RDS_IB_RESOLVE_TIMEOUT_MS);
+	if (ret) {
+		rdsdebug("addr resolve failed for cm id %p: %d\n", ic->i_cm_id,
+			 ret);
+		rdma_destroy_id(ic->i_cm_id);
+		ic->i_cm_id = NULL;
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * This is so careful about only cleaning up resources that were built up
+ * so that it can be called at any point during startup.  In fact it
+ * can be called multiple times for a given connection.
+ */
+void rds_ib_conn_shutdown(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	int err = 0;
+
+	rdsdebug("cm %p pd %p cq %p %p qp %p\n", ic->i_cm_id,
+		 ic->i_pd, ic->i_send_cq, ic->i_recv_cq,
+		 ic->i_cm_id ? ic->i_cm_id->qp : NULL);
+
+	if (ic->i_cm_id) {
+		struct ib_device *dev = ic->i_cm_id->device;
+
+		rdsdebug("disconnecting cm %p\n", ic->i_cm_id);
+		err = rdma_disconnect(ic->i_cm_id);
+		if (err) {
+			/* Actually this may happen quite frequently, when
+			 * an outgoing connect raced with an incoming connect.
+			 */
+			rdsdebug("failed to disconnect, cm: %p err %d\n",
+				ic->i_cm_id, err);
+		}
+
+		wait_event(rds_ib_ring_empty_wait,
+			rds_ib_ring_empty(&ic->i_send_ring) &&
+			rds_ib_ring_empty(&ic->i_recv_ring));
+
+		if (ic->i_send_hdrs)
+			ib_dma_free_coherent(dev,
+					   ic->i_send_ring.w_nr *
+						sizeof(struct rds_header),
+					   ic->i_send_hdrs,
+					   ic->i_send_hdrs_dma);
+
+		if (ic->i_recv_hdrs)
+			ib_dma_free_coherent(dev,
+					   ic->i_recv_ring.w_nr *
+						sizeof(struct rds_header),
+					   ic->i_recv_hdrs,
+					   ic->i_recv_hdrs_dma);
+
+		if (ic->i_ack)
+			ib_dma_free_coherent(dev, sizeof(struct rds_header),
+					     ic->i_ack, ic->i_ack_dma);
+
+		if (ic->i_sends)
+			rds_ib_send_clear_ring(ic);
+		if (ic->i_recvs)
+			rds_ib_recv_clear_ring(ic);
+
+		if (ic->i_cm_id->qp)
+			rdma_destroy_qp(ic->i_cm_id);
+		if (ic->i_send_cq)
+			ib_destroy_cq(ic->i_send_cq);
+		if (ic->i_recv_cq)
+			ib_destroy_cq(ic->i_recv_cq);
+		rdma_destroy_id(ic->i_cm_id);
+
+		/*
+		 * Move connection back to the nodev list.
+		 */
+		if (ic->rds_ibdev) {
+
+			spin_lock_irq(&ic->rds_ibdev->spinlock);
+			BUG_ON(list_empty(&ic->ib_node));
+			list_del(&ic->ib_node);
+			spin_unlock_irq(&ic->rds_ibdev->spinlock);
+
+			spin_lock_irq(&ib_nodev_conns_lock);
+			list_add_tail(&ic->ib_node, &ib_nodev_conns);
+			spin_unlock_irq(&ib_nodev_conns_lock);
+			ic->rds_ibdev = NULL;
+		}
+
+		ic->i_cm_id = NULL;
+		ic->i_pd = NULL;
+		ic->i_mr = NULL;
+		ic->i_send_cq = NULL;
+		ic->i_recv_cq = NULL;
+		ic->i_send_hdrs = NULL;
+		ic->i_recv_hdrs = NULL;
+		ic->i_ack = NULL;
+	}
+	BUG_ON(ic->rds_ibdev);
+
+	/* Clear pending transmit */
+	if (ic->i_rm) {
+		rds_message_put(ic->i_rm);
+		ic->i_rm = NULL;
+	}
+
+	/* Clear the ACK state */
+	clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+#ifdef KERNEL_HAS_ATOMIC64
+	atomic64_set(&ic->i_ack_next, 0);
+#else
+	ic->i_ack_next = 0;
+#endif
+	ic->i_ack_recv = 0;
+
+	/* Clear flow control state */
+	ic->i_flowctl = 0;
+	atomic_set(&ic->i_credits, 0);
+
+	rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr);
+	rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr);
+
+	if (ic->i_ibinc) {
+		rds_inc_put(&ic->i_ibinc->ii_inc);
+		ic->i_ibinc = NULL;
+	}
+
+	vfree(ic->i_sends);
+	ic->i_sends = NULL;
+	vfree(ic->i_recvs);
+	ic->i_recvs = NULL;
+}
+
+int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp)
+{
+	struct rds_ib_connection *ic;
+	unsigned long flags;
+
+	/* XXX too lazy? */
+	ic = kzalloc(sizeof(struct rds_ib_connection), GFP_KERNEL);
+	if (ic == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&ic->ib_node);
+	mutex_init(&ic->i_recv_mutex);
+#ifndef KERNEL_HAS_ATOMIC64
+	spin_lock_init(&ic->i_ack_lock);
+#endif
+
+	/*
+	 * rds_ib_conn_shutdown() waits for these to be emptied so they
+	 * must be initialized before it can be called.
+	 */
+	rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr);
+	rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr);
+
+	ic->conn = conn;
+	conn->c_transport_data = ic;
+
+	spin_lock_irqsave(&ib_nodev_conns_lock, flags);
+	list_add_tail(&ic->ib_node, &ib_nodev_conns);
+	spin_unlock_irqrestore(&ib_nodev_conns_lock, flags);
+
+
+	rdsdebug("conn %p conn ic %p\n", conn, conn->c_transport_data);
+	return 0;
+}
+
+void rds_ib_conn_free(void *arg)
+{
+	struct rds_ib_connection *ic = arg;
+	rdsdebug("ic %p\n", ic);
+	list_del(&ic->ib_node);
+	kfree(ic);
+}
+
+int __init rds_ib_listen_init(void)
+{
+	struct sockaddr_in sin;
+	struct rdma_cm_id *cm_id;
+	int ret;
+
+	cm_id = rdma_create_id(rds_ib_cm_event_handler, NULL, RDMA_PS_TCP);
+	if (IS_ERR(cm_id)) {
+		ret = PTR_ERR(cm_id);
+		printk(KERN_ERR "RDS/ib: failed to setup listener, "
+		       "rdma_create_id() returned %d\n", ret);
+		goto out;
+	}
+
+	sin.sin_family = PF_INET,
+	sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY);
+	sin.sin_port = (__force u16)htons(RDS_PORT);
+
+	/*
+	 * XXX I bet this binds the cm_id to a device.  If we want to support
+	 * fail-over we'll have to take this into consideration.
+	 */
+	ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
+	if (ret) {
+		printk(KERN_ERR "RDS/ib: failed to setup listener, "
+		       "rdma_bind_addr() returned %d\n", ret);
+		goto out;
+	}
+
+	ret = rdma_listen(cm_id, 128);
+	if (ret) {
+		printk(KERN_ERR "RDS/ib: failed to setup listener, "
+		       "rdma_listen() returned %d\n", ret);
+		goto out;
+	}
+
+	rdsdebug("cm %p listening on port %u\n", cm_id, RDS_PORT);
+
+	rds_ib_listen_id = cm_id;
+	cm_id = NULL;
+out:
+	if (cm_id)
+		rdma_destroy_id(cm_id);
+	return ret;
+}
+
+void rds_ib_listen_stop(void)
+{
+	if (rds_ib_listen_id) {
+		rdsdebug("cm %p\n", rds_ib_listen_id);
+		rdma_destroy_id(rds_ib_listen_id);
+		rds_ib_listen_id = NULL;
+	}
+}
+
+/*
+ * An error occurred on the connection
+ */
+void
+__rds_ib_conn_error(struct rds_connection *conn, const char *fmt, ...)
+{
+	va_list ap;
+
+	rds_conn_drop(conn);
+
+	va_start(ap, fmt);
+	vprintk(fmt, ap);
+	va_end(ap);
+}
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 14/21] RDS/IB: Ring-handling code.
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (12 preceding siblings ...)
  2009-01-27  2:17 ` [ofa-general] [PATCH 13/21] RDS/IB: Infiniband transport Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 15/21] RDS/IB: Implement RDMA ops using FMRs Andy Grover
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_ring.c |  168 ++++++++++++++++++++++++++++++++++
 1 files changed, 168 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c

diff --git a/drivers/infiniband/ulp/rds/ib_ring.c b/drivers/infiniband/ulp/rds/ib_ring.c
new file mode 100644
index 0000000..d23cc59
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_ring.c
@@ -0,0 +1,168 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "ib.h"
+
+/*
+ * Locking for IB rings.
+ * We assume that allocation is always protected by a mutex
+ * in the caller (this is a valid assumption for the current
+ * implementation).
+ *
+ * Freeing always happens in an interrupt, and hence only
+ * races with allocations, but not with other free()s.
+ *
+ * The interaction between allocation and freeing is that
+ * the alloc code has to determine the number of free entries.
+ * To this end, we maintain two counters; an allocation counter
+ * and a free counter. Both are allowed to run freely, and wrap
+ * around.
+ * The number of used entries is always (alloc_ctr - free_ctr) % NR.
+ *
+ * The current implementation makes free_ctr atomic. When the
+ * caller finds an allocation fails, it should set an "alloc fail"
+ * bit and retry the allocation. The "alloc fail" bit essentially tells
+ * the CQ completion handlers to wake it up after freeing some
+ * more entries.
+ */
+
+/*
+ * This only happens on shutdown.
+ */
+DECLARE_WAIT_QUEUE_HEAD(rds_ib_ring_empty_wait);
+
+void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr)
+{
+	memset(ring, 0, sizeof(*ring));
+	ring->w_nr = nr;
+	rdsdebug("ring %p nr %u\n", ring, ring->w_nr);
+}
+
+static inline u32 __rds_ib_ring_used(struct rds_ib_work_ring *ring)
+{
+	u32 diff;
+
+	/* This assumes that atomic_t has at least as many bits as u32 */
+	diff = ring->w_alloc_ctr - (u32) atomic_read(&ring->w_free_ctr);
+	BUG_ON(diff > ring->w_nr);
+
+	return diff;
+}
+
+void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr)
+{
+	/* We only ever get called from the connection setup code,
+	 * prior to creating the QP. */
+	BUG_ON(__rds_ib_ring_used(ring));
+	ring->w_nr = nr;
+}
+
+static int __rds_ib_ring_empty(struct rds_ib_work_ring *ring)
+{
+	return __rds_ib_ring_used(ring) == 0;
+}
+
+u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos)
+{
+	u32 ret = 0, avail;
+
+	avail = ring->w_nr - __rds_ib_ring_used(ring);
+
+	rdsdebug("ring %p val %u next %u free %u\n", ring, val,
+		 ring->w_alloc_ptr, avail);
+
+	if (val && avail) {
+		ret = min(val, avail);
+		*pos = ring->w_alloc_ptr;
+
+		ring->w_alloc_ptr = (ring->w_alloc_ptr + ret) % ring->w_nr;
+		ring->w_alloc_ctr += ret;
+	}
+
+	return ret;
+}
+
+void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val)
+{
+	ring->w_free_ptr = (ring->w_free_ptr + val) % ring->w_nr;
+	atomic_add(val, &ring->w_free_ctr);
+
+	if (__rds_ib_ring_empty(ring) &&
+	    waitqueue_active(&rds_ib_ring_empty_wait))
+		wake_up(&rds_ib_ring_empty_wait);
+}
+
+void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val)
+{
+	ring->w_alloc_ptr = (ring->w_alloc_ptr - val) % ring->w_nr;
+	ring->w_alloc_ctr -= val;
+}
+
+int rds_ib_ring_empty(struct rds_ib_work_ring *ring)
+{
+	return __rds_ib_ring_empty(ring);
+}
+
+int rds_ib_ring_low(struct rds_ib_work_ring *ring)
+{
+        return __rds_ib_ring_used(ring) <= (ring->w_nr >> 2);
+}
+
+/*
+ * returns the oldest alloced ring entry.  This will be the next one
+ * freed.  This can't be called if there are none allocated.
+ */
+u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring)
+{
+	return ring->w_free_ptr;
+}
+
+/*
+ * returns the number of completed work requests.
+ */
+
+u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest)
+{
+	u32 ret;
+
+	if (oldest <= (unsigned long long)wr_id)
+		ret = (unsigned long long)wr_id - oldest + 1;
+	else
+		ret = ring->w_nr - oldest + (unsigned long long)wr_id + 1;
+
+	rdsdebug("ring %p ret %u wr_id %u oldest %u\n", ring, ret,
+		 wr_id, oldest);
+	return ret;
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 15/21] RDS/IB: Implement RDMA ops using FMRs
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (13 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 14/21] RDS/IB: Ring-handling code Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 16/21] RDS/IB: Implement IB-specific datagram send Andy Grover
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_rdma.c |  641 ++++++++++++++++++++++++++++++++++
 1 files changed, 641 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c

diff --git a/drivers/infiniband/ulp/rds/ib_rdma.c b/drivers/infiniband/ulp/rds/ib_rdma.c
new file mode 100644
index 0000000..69a6289
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_rdma.c
@@ -0,0 +1,641 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include "ib.h"
+
+
+/*
+ * This is stored as mr->r_trans_private.
+ */
+struct rds_ib_mr {
+	struct rds_ib_device	*device;
+	struct rds_ib_mr_pool	*pool;
+	struct ib_fmr		*fmr;
+	struct list_head	list;
+	unsigned int		remap_count;
+
+	struct scatterlist	*sg;
+	unsigned int		sg_len;
+	u64			*dma;
+	int			sg_dma_len;
+};
+
+/*
+ * Our own little FMR pool
+ */
+struct rds_ib_mr_pool {
+	struct mutex		flush_lock;		/* serialize fmr invalidate */
+	struct work_struct	flush_worker;		/* flush worker */
+
+	spinlock_t		list_lock;		/* protect variables below */
+	atomic_t		item_count;		/* total # of MRs */
+	atomic_t		dirty_count;		/* # dirty of MRs */
+	struct list_head	drop_list;		/* MRs that have reached their max_maps limit */
+	struct list_head	free_list;		/* unused MRs */
+	struct list_head	clean_list;		/* unused & unamapped MRs */
+	atomic_t		free_pinned;		/* memory pinned by free MRs */
+	unsigned long		max_items;
+	unsigned long		max_items_soft;
+	unsigned long		max_free_pinned;
+	struct ib_fmr_attr	fmr_attr;
+};
+
+static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all);
+static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr);
+static void rds_ib_mr_pool_flush_worker(struct work_struct *work);
+
+static struct rds_ib_device *rds_ib_get_device(__be32 ipaddr)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct rds_ib_ipaddr *i_ipaddr;
+
+	list_for_each_entry(rds_ibdev, &rds_ib_devices, list) {
+		spin_lock_irq(&rds_ibdev->spinlock);
+		list_for_each_entry(i_ipaddr, &rds_ibdev->ipaddr_list, list) {
+			if (i_ipaddr->ipaddr == ipaddr) {
+				spin_unlock_irq(&rds_ibdev->spinlock);
+				return rds_ibdev;
+			}
+		}
+		spin_unlock_irq(&rds_ibdev->spinlock);
+	}
+
+	return NULL;
+}
+
+static int rds_ib_add_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
+{
+	struct rds_ib_ipaddr *i_ipaddr;
+
+	i_ipaddr = kmalloc(sizeof *i_ipaddr, GFP_KERNEL);
+	if (!i_ipaddr)
+		return -ENOMEM;
+
+	i_ipaddr->ipaddr = ipaddr;
+
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_add_tail(&i_ipaddr->list, &rds_ibdev->ipaddr_list);
+	spin_unlock_irq(&rds_ibdev->spinlock);
+
+	return 0;
+}
+
+static void rds_ib_remove_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
+{
+	struct rds_ib_ipaddr *i_ipaddr, *next;
+
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_for_each_entry_safe(i_ipaddr, next, &rds_ibdev->ipaddr_list, list) {
+		if (i_ipaddr->ipaddr == ipaddr) {
+			list_del(&i_ipaddr->list);
+			kfree(i_ipaddr);
+			break;
+		}
+	}
+	spin_unlock_irq(&rds_ibdev->spinlock);
+}
+
+int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
+{
+	struct rds_ib_device *rds_ibdev_old;
+
+	rds_ibdev_old = rds_ib_get_device(ipaddr);
+	if (rds_ibdev_old)
+		rds_ib_remove_ipaddr(rds_ibdev_old, ipaddr);
+
+	return rds_ib_add_ipaddr(rds_ibdev, ipaddr);
+}
+
+int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	/* conn was previously on the nodev_conns_list */
+	spin_lock_irq(&ib_nodev_conns_lock);
+	BUG_ON(list_empty(&ib_nodev_conns));
+	BUG_ON(list_empty(&ic->ib_node));
+	list_del(&ic->ib_node);
+	spin_unlock_irq(&ib_nodev_conns_lock);
+
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_add_tail(&ic->ib_node, &rds_ibdev->conn_list);
+	spin_unlock_irq(&rds_ibdev->spinlock);
+
+	ic->rds_ibdev = rds_ibdev;
+
+	return 0;
+}
+
+void rds_ib_remove_nodev_conns(void)
+{
+	struct rds_ib_connection *ic, *_ic;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&ib_nodev_conns_lock);
+	list_splice(&ib_nodev_conns, &tmp_list);
+	INIT_LIST_HEAD(&ib_nodev_conns);
+	spin_unlock_irq(&ib_nodev_conns_lock);
+
+	list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) {
+		if (ic->conn->c_passive)
+			rds_conn_destroy(ic->conn->c_passive);
+		rds_conn_destroy(ic->conn);
+	}
+}
+
+void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev)
+{
+	struct rds_ib_connection *ic, *_ic;
+	LIST_HEAD(tmp_list);
+
+	/* avoid calling conn_destroy with irqs off */
+	spin_lock_irq(&rds_ibdev->spinlock);
+	list_splice(&rds_ibdev->conn_list, &tmp_list);
+	INIT_LIST_HEAD(&rds_ibdev->conn_list);
+	spin_unlock_irq(&rds_ibdev->spinlock);
+
+	list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) {
+		if (ic->conn->c_passive)
+			rds_conn_destroy(ic->conn->c_passive);
+		rds_conn_destroy(ic->conn);
+	}
+}
+
+struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev)
+{
+	struct rds_ib_mr_pool *pool;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&pool->free_list);
+	INIT_LIST_HEAD(&pool->drop_list);
+	INIT_LIST_HEAD(&pool->clean_list);
+	mutex_init(&pool->flush_lock);
+	spin_lock_init(&pool->list_lock);
+	INIT_WORK(&pool->flush_worker, rds_ib_mr_pool_flush_worker);
+
+	pool->fmr_attr.max_pages = fmr_message_size;
+	pool->fmr_attr.max_maps = rds_ibdev->fmr_max_remaps;
+	pool->fmr_attr.page_shift = rds_ibdev->fmr_page_shift;
+	pool->max_free_pinned = rds_ibdev->max_fmrs * fmr_message_size / 4;
+
+	/* We never allow more than max_items MRs to be allocated.
+	 * When we exceed more than max_items_soft, we start freeing
+	 * items more aggressively.
+	 * Make sure that max_items > max_items_soft > max_items / 2
+	 */
+	pool->max_items_soft = rds_ibdev->max_fmrs * 3 / 4;
+	pool->max_items = rds_ibdev->max_fmrs;
+
+	return pool;
+}
+
+void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo)
+{
+	struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+
+	iinfo->rdma_mr_max = pool->max_items;
+	iinfo->rdma_mr_size = pool->fmr_attr.max_pages;
+}
+
+void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *pool)
+{
+	flush_workqueue(rds_wq);
+	rds_ib_flush_mr_pool(pool, 1);
+	BUG_ON(atomic_read(&pool->item_count));
+	BUG_ON(atomic_read(&pool->free_pinned));
+	kfree(pool);
+}
+
+static inline struct rds_ib_mr *rds_ib_reuse_fmr(struct rds_ib_mr_pool *pool)
+{
+	struct rds_ib_mr *ibmr = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	if (!list_empty(&pool->clean_list)) {
+		ibmr = list_entry(pool->clean_list.next, struct rds_ib_mr, list);
+		list_del_init(&ibmr->list);
+	}
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	return ibmr;
+}
+
+static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev)
+{
+	struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+	struct rds_ib_mr *ibmr = NULL;
+	int err = 0, iter = 0;
+
+	while (1) {
+		ibmr = rds_ib_reuse_fmr(pool);
+		if (ibmr)
+			return ibmr;
+
+		/* No clean MRs - now we have the choice of either
+		 * allocating a fresh MR up to the limit imposed by the
+		 * driver, or flush any dirty unused MRs.
+		 * We try to avoid stalling in the send path if possible,
+		 * so we allocate as long as we're allowed to.
+		 *
+		 * We're fussy with enforcing the FMR limit, though. If the driver
+		 * tells us we can't use more than N fmrs, we shouldn't start
+		 * arguing with it */
+		if (atomic_inc_return(&pool->item_count) <= pool->max_items)
+			break;
+
+		atomic_dec(&pool->item_count);
+
+		if (++iter > 2) {
+			rds_ib_stats_inc(s_ib_rdma_mr_pool_depleted);
+			return ERR_PTR(-EAGAIN);
+		}
+
+		/* We do have some empty MRs. Flush them out. */
+		rds_ib_stats_inc(s_ib_rdma_mr_pool_wait);
+		rds_ib_flush_mr_pool(pool, 0);
+	}
+
+	ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL);
+	if (!ibmr) {
+		err = -ENOMEM;
+		goto out_no_cigar;
+	}
+
+	ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+			(IB_ACCESS_LOCAL_WRITE |
+			 IB_ACCESS_REMOTE_READ |
+			 IB_ACCESS_REMOTE_WRITE),
+			&pool->fmr_attr);
+	if (IS_ERR(ibmr->fmr)) {
+		err = PTR_ERR(ibmr->fmr);
+		ibmr->fmr = NULL;
+		printk(KERN_WARNING "RDS/IB: ib_alloc_fmr failed (err=%d)\n", err);
+		goto out_no_cigar;
+	}
+
+	rds_ib_stats_inc(s_ib_rdma_mr_alloc);
+	return ibmr;
+
+out_no_cigar:
+	if (ibmr) {
+		if (ibmr->fmr)
+			ib_dealloc_fmr(ibmr->fmr);
+		kfree(ibmr);
+	}
+	atomic_dec(&pool->item_count);
+	return ERR_PTR(err);
+}
+
+static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr,
+	       struct scatterlist *sg, unsigned int nents)
+{
+	struct ib_device *dev = rds_ibdev->dev;
+	struct scatterlist *scat = sg;
+	u64 io_addr = 0;
+	u64 *dma_pages;
+	u32 len;
+	int page_cnt, sg_dma_len;
+	int i, j;
+	int ret;
+
+	sg_dma_len = ib_dma_map_sg(dev, sg, nents,
+				 DMA_BIDIRECTIONAL);
+	if (unlikely(!sg_dma_len)) {
+		printk(KERN_WARNING "RDS/IB: dma_map_sg failed!\n");
+		return -EBUSY;
+	}
+
+	len = 0;
+	page_cnt = 0;
+
+	for (i = 0; i < sg_dma_len; ++i) {
+		unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]);
+		u64 dma_addr = ib_sg_dma_address(dev, &scat[i]);
+
+		if (dma_addr & ~rds_ibdev->fmr_page_mask) {
+			if (i > 0)
+				return -EINVAL;
+			else
+				++page_cnt;
+		}
+		if ((dma_addr + dma_len) & ~rds_ibdev->fmr_page_mask) {
+			if (i < sg_dma_len - 1)
+				return -EINVAL;
+			else
+				++page_cnt;
+		}
+
+		len += dma_len;
+	}
+
+	page_cnt += len >> rds_ibdev->fmr_page_shift;
+	if (page_cnt > fmr_message_size)
+		return -EINVAL;
+
+	dma_pages = kmalloc(sizeof(u64) * page_cnt, GFP_ATOMIC);
+	if (!dma_pages)
+		return -ENOMEM;
+
+	page_cnt = 0;
+	for (i = 0; i < sg_dma_len; ++i) {
+		unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]);
+		u64 dma_addr = ib_sg_dma_address(dev, &scat[i]);
+
+		for (j = 0; j < dma_len; j += rds_ibdev->fmr_page_size)
+			dma_pages[page_cnt++] =
+				(dma_addr & rds_ibdev->fmr_page_mask) + j;
+	}
+
+	ret = ib_map_phys_fmr(ibmr->fmr,
+				   dma_pages, page_cnt, io_addr);
+	if (ret)
+		goto out;
+
+	/* Success - we successfully remapped the MR, so we can
+	 * safely tear down the old mapping. */
+	rds_ib_teardown_mr(ibmr);
+
+	ibmr->sg = scat;
+	ibmr->sg_len = nents;
+	ibmr->sg_dma_len = sg_dma_len;
+	ibmr->remap_count++;
+
+	rds_ib_stats_inc(s_ib_rdma_mr_used);
+	ret = 0;
+
+out:
+	kfree(dma_pages);
+
+	return ret;
+}
+
+void rds_ib_sync_mr(void *trans_private, int direction)
+{
+	struct rds_ib_mr *ibmr = trans_private;
+	struct rds_ib_device *rds_ibdev = ibmr->device;
+
+	switch (direction) {
+	case DMA_FROM_DEVICE:
+		ib_dma_sync_sg_for_cpu(rds_ibdev->dev, ibmr->sg,
+			ibmr->sg_dma_len, DMA_BIDIRECTIONAL);
+		break;
+	case DMA_TO_DEVICE:
+		ib_dma_sync_sg_for_device(rds_ibdev->dev, ibmr->sg,
+			ibmr->sg_dma_len, DMA_BIDIRECTIONAL);
+		break;
+	}
+}
+
+static void __rds_ib_teardown_mr(struct rds_ib_mr *ibmr)
+{
+	struct rds_ib_device *rds_ibdev = ibmr->device;
+
+	if (ibmr->sg_dma_len) {
+		ib_dma_unmap_sg(rds_ibdev->dev,
+				ibmr->sg, ibmr->sg_len,
+				DMA_BIDIRECTIONAL);
+		ibmr->sg_dma_len = 0;
+	}
+
+	/* Release the s/g list */
+	if (ibmr->sg_len) {
+		unsigned int i;
+
+		for (i = 0; i < ibmr->sg_len; ++i) {
+			struct page *page = sg_page(&ibmr->sg[i]);
+
+			/* FIXME we need a way to tell a r/w MR
+			 * from a r/o MR */
+			set_page_dirty(page);
+			put_page(page);
+		}
+		kfree(ibmr->sg);
+
+		ibmr->sg = NULL;
+		ibmr->sg_len = 0;
+	}
+}
+
+static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr)
+{
+	unsigned int pinned = ibmr->sg_len;
+
+	__rds_ib_teardown_mr(ibmr);
+	if (pinned) {
+		struct rds_ib_device *rds_ibdev = ibmr->device;
+		struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+
+		atomic_sub(pinned, &pool->free_pinned);
+	}
+}
+
+static inline unsigned int rds_ib_flush_goal(struct rds_ib_mr_pool *pool, int free_all)
+{
+	unsigned int item_count;
+
+	item_count = atomic_read(&pool->item_count);
+	if (free_all)
+		return item_count;
+
+	return 0;
+}
+
+/*
+ * Flush our pool of MRs.
+ * At a minimum, all currently unused MRs are unmapped.
+ * If the number of MRs allocated exceeds the limit, we also try
+ * to free as many MRs as needed to get back to this limit.
+ */
+static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all)
+{
+	struct rds_ib_mr *ibmr, *next;
+	LIST_HEAD(unmap_list);
+	LIST_HEAD(fmr_list);
+	unsigned long unpinned = 0;
+	unsigned long flags;
+	unsigned int nfreed = 0, ncleaned = 0, free_goal;
+	int ret = 0;
+
+	rds_ib_stats_inc(s_ib_rdma_mr_pool_flush);
+
+	mutex_lock(&pool->flush_lock);
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	/* Get the list of all MRs to be dropped. Ordering matters -
+	 * we want to put drop_list ahead of free_list. */
+	list_splice_init(&pool->free_list, &unmap_list);
+	list_splice_init(&pool->drop_list, &unmap_list);
+	if (free_all)
+		list_splice_init(&pool->clean_list, &unmap_list);
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	free_goal = rds_ib_flush_goal(pool, free_all);
+
+	if (list_empty(&unmap_list))
+		goto out;
+
+	/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
+	list_for_each_entry(ibmr, &unmap_list, list)
+		list_add(&ibmr->fmr->list, &fmr_list);
+	ret = ib_unmap_fmr(&fmr_list);
+	if (ret)
+		printk(KERN_WARNING "RDS/IB: ib_unmap_fmr failed (err=%d)\n", ret);
+
+	/* Now we can destroy the DMA mapping and unpin any pages */
+	list_for_each_entry_safe(ibmr, next, &unmap_list, list) {
+		unpinned += ibmr->sg_len;
+		__rds_ib_teardown_mr(ibmr);
+		if (nfreed < free_goal || ibmr->remap_count >= pool->fmr_attr.max_maps) {
+			rds_ib_stats_inc(s_ib_rdma_mr_free);
+			list_del(&ibmr->list);
+			ib_dealloc_fmr(ibmr->fmr);
+			kfree(ibmr);
+			nfreed++;
+		}
+		ncleaned++;
+	}
+
+	spin_lock_irqsave(&pool->list_lock, flags);
+	list_splice(&unmap_list, &pool->clean_list);
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	atomic_sub(unpinned, &pool->free_pinned);
+	atomic_sub(ncleaned, &pool->dirty_count);
+	atomic_sub(nfreed, &pool->item_count);
+
+out:
+	mutex_unlock(&pool->flush_lock);
+	return ret;
+}
+
+static void rds_ib_mr_pool_flush_worker(struct work_struct *work)
+{
+	struct rds_ib_mr_pool *pool = container_of(work, struct rds_ib_mr_pool, flush_worker);
+
+	rds_ib_flush_mr_pool(pool, 0);
+}
+
+void rds_ib_free_mr(void *trans_private, int invalidate)
+{
+	struct rds_ib_mr *ibmr = trans_private;
+	struct rds_ib_device *rds_ibdev = ibmr->device;
+	struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+	unsigned long flags;
+
+	rdsdebug("RDS/IB: free_mr nents %u\n", ibmr->sg_len);
+
+	/* Return it to the pool's free list */
+	spin_lock_irqsave(&pool->list_lock, flags);
+	if (ibmr->remap_count >= pool->fmr_attr.max_maps)
+		list_add(&ibmr->list, &pool->drop_list);
+	else
+		list_add(&ibmr->list, &pool->free_list);
+
+	atomic_add(ibmr->sg_len, &pool->free_pinned);
+	atomic_inc(&pool->dirty_count);
+	spin_unlock_irqrestore(&pool->list_lock, flags);
+
+	/* If we've pinned too many pages, request a flush */
+	if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned
+	 || atomic_read(&pool->dirty_count) >= pool->max_items / 10)
+		queue_work(rds_wq, &pool->flush_worker);
+
+	if (invalidate) {
+		if (likely(!in_interrupt())) {
+			rds_ib_flush_mr_pool(pool, 0);
+		} else {
+			/* We get here if the user created a MR marked
+			 * as use_once and invalidate at the same time. */
+			queue_work(rds_wq, &pool->flush_worker);
+		}
+	}
+}
+
+void rds_ib_flush_mrs(void)
+{
+	struct rds_ib_device *rds_ibdev;
+
+	list_for_each_entry(rds_ibdev, &rds_ib_devices, list) {
+		struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+
+		if (pool)
+			rds_ib_flush_mr_pool(pool, 0);
+	}
+}
+
+void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
+		    struct rds_sock *rs, u32 *key_ret)
+{
+	struct rds_ib_device *rds_ibdev;
+	struct rds_ib_mr *ibmr = NULL;
+	int ret;
+
+	rds_ibdev = rds_ib_get_device(rs->rs_bound_addr);
+	if (!rds_ibdev) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	if (!rds_ibdev->mr_pool) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	ibmr = rds_ib_alloc_fmr(rds_ibdev);
+	if (IS_ERR(ibmr))
+		return ibmr;
+
+	ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+	if (ret == 0)
+		*key_ret = ibmr->fmr->rkey;
+	else
+		printk(KERN_WARNING "RDS/IB: map_fmr failed (errno=%d)\n", ret);
+
+	ibmr->device = rds_ibdev;
+
+ out:
+	if (ret) {
+		if (ibmr)
+			rds_ib_free_mr(ibmr, 0);
+		ibmr = ERR_PTR(ret);
+	}
+	return ibmr;
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 16/21] RDS/IB: Implement IB-specific datagram send.
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (14 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 15/21] RDS/IB: Implement RDMA ops using FMRs Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 17/21] RDS/IB: Receive datagrams via IB Andy Grover
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Specific to IB is a credits-based flow control mechanism, in
addition to the expected usage of the IB API to package outgoing
data into work requests.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_send.c |  852 ++++++++++++++++++++++++++++++++++
 1 files changed, 852 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_send.c

diff --git a/drivers/infiniband/ulp/rds/ib_send.c b/drivers/infiniband/ulp/rds/ib_send.c
new file mode 100644
index 0000000..20af976
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_send.c
@@ -0,0 +1,852 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/in.h>
+#include <linux/device.h>
+#include <linux/dmapool.h>
+
+#include "rds.h"
+#include "rdma.h"
+#include "ib.h"
+
+static void rds_ib_send_rdma_complete(struct rds_message *rm,
+				      int wc_status)
+{
+	int notify_status;
+
+	switch (wc_status) {
+	case IB_WC_WR_FLUSH_ERR:
+		return;
+
+	case IB_WC_SUCCESS:
+		notify_status = RDS_RDMA_SUCCESS;
+		break;
+
+	case IB_WC_REM_ACCESS_ERR:
+		notify_status = RDS_RDMA_REMOTE_ERROR;
+		break;
+
+	default:
+		notify_status = RDS_RDMA_OTHER_ERROR;
+		break;
+	}
+	rds_rdma_send_complete(rm, notify_status);
+}
+
+static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic,
+				   struct rds_rdma_op *op)
+{
+	if (op->r_mapped) {
+		ib_dma_unmap_sg(ic->i_cm_id->device,
+			op->r_sg, op->r_nents,
+			op->r_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
+		op->r_mapped = 0;
+	}
+}
+
+static void rds_ib_send_unmap_rm(struct rds_ib_connection *ic,
+			  struct rds_ib_send_work *send,
+			  int wc_status)
+{
+	struct rds_message *rm = send->s_rm;
+
+	rdsdebug("ic %p send %p rm %p\n", ic, send, rm);
+
+	ib_dma_unmap_sg(ic->i_cm_id->device,
+		     rm->m_sg, rm->m_nents,
+		     DMA_TO_DEVICE);
+
+	if (rm->m_rdma_op != NULL) {
+		rds_ib_send_unmap_rdma(ic, rm->m_rdma_op);
+
+		/* If the user asked for a completion notification on this
+		 * message, we can implement three different semantics:
+		 *  1.	Notify when we received the ACK on the RDS message
+		 *	that was queued with the RDMA. This provides reliable
+		 *	notification of RDMA status at the expense of a one-way
+		 *	packet delay.
+		 *  2.	Notify when the IB stack gives us the completion event for
+		 *	the RDMA operation.
+		 *  3.	Notify when the IB stack gives us the completion event for
+		 *	the accompanying RDS messages.
+		 * Here, we implement approach #3. To implement approach #2,
+		 * call rds_rdma_send_complete from the cq_handler. To implement #1,
+		 * don't call rds_rdma_send_complete at all, and fall back to the notify
+		 * handling in the ACK processing code.
+		 *
+		 * Note: There's no need to explicitly sync any RDMA buffers using
+		 * ib_dma_sync_sg_for_cpu - the completion for the RDMA
+		 * operation itself unmapped the RDMA buffers, which takes care
+		 * of synching.
+		 */
+		rds_ib_send_rdma_complete(rm, wc_status);
+
+		if (rm->m_rdma_op->r_write)
+			rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes);
+		else
+			rds_stats_add(s_recv_rdma_bytes, rm->m_rdma_op->r_bytes);
+	}
+
+	/* If anyone waited for this message to get flushed out, wake
+	 * them up now */
+	rds_message_unmapped(rm);
+
+	rds_message_put(rm);
+	send->s_rm = NULL;
+}
+
+void rds_ib_send_init_ring(struct rds_ib_connection *ic)
+{
+	struct rds_ib_send_work *send;
+	u32 i;
+
+	for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) {
+		struct ib_sge *sge;
+
+		send->s_rm = NULL;
+		send->s_op = NULL;
+
+		send->s_wr.wr_id = i;
+		send->s_wr.sg_list = send->s_sge;
+		send->s_wr.num_sge = 1;
+		send->s_wr.opcode = IB_WR_SEND;
+		send->s_wr.send_flags = 0;
+		send->s_wr.ex.imm_data = 0;
+
+		sge = rds_ib_data_sge(ic, send->s_sge);
+		sge->lkey = ic->i_mr->lkey;
+
+		sge = rds_ib_header_sge(ic, send->s_sge);
+		sge->addr = ic->i_send_hdrs_dma + (i * sizeof(struct rds_header));
+		sge->length = sizeof(struct rds_header);
+		sge->lkey = ic->i_mr->lkey;
+	}
+}
+
+void rds_ib_send_clear_ring(struct rds_ib_connection *ic)
+{
+	struct rds_ib_send_work *send;
+	u32 i;
+
+	for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) {
+		if (send->s_wr.opcode == 0xdead)
+			continue;
+		if (send->s_rm)
+			rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR);
+		if (send->s_op)
+			rds_ib_send_unmap_rdma(ic, send->s_op);
+	}
+}
+
+/*
+ * The _oldest/_free ring operations here race cleanly with the alloc/unalloc
+ * operations performed in the send path.  As the sender allocs and potentially
+ * unallocs the next free entry in the ring it doesn't alter which is
+ * the next to be freed, which is what this is concerned with.
+ */
+void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context)
+{
+	struct rds_connection *conn = context;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_wc wc;
+	struct rds_ib_send_work *send;
+	u32 completed;
+	u32 oldest;
+	u32 i = 0;
+	int ret;
+
+	rdsdebug("cq %p conn %p\n", cq, conn);
+	rds_ib_stats_inc(s_ib_tx_cq_call);
+	ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	if (ret)
+		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
+
+	while (ib_poll_cq(cq, 1, &wc) > 0) {
+		rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n",
+			 (unsigned long long)wc.wr_id, wc.status, wc.byte_len,
+			 be32_to_cpu(wc.ex.imm_data));
+		rds_ib_stats_inc(s_ib_tx_cq_event);
+
+		if (wc.wr_id == RDS_IB_ACK_WR_ID) {
+			if (ic->i_ack_queued + HZ/2 < jiffies)
+				rds_ib_stats_inc(s_ib_tx_stalled);
+			rds_ib_ack_send_complete(ic);
+			continue;
+		}
+
+		oldest = rds_ib_ring_oldest(&ic->i_send_ring);
+
+		completed = rds_ib_ring_completed(&ic->i_send_ring, wc.wr_id, oldest);
+
+		for (i = 0; i < completed; i++) {
+			send = &ic->i_sends[oldest];
+
+			/* In the error case, wc.opcode sometimes contains garbage */
+			switch (send->s_wr.opcode) {
+			case IB_WR_SEND:
+				if (send->s_rm)
+					rds_ib_send_unmap_rm(ic, send, wc.status);
+				break;
+			case IB_WR_RDMA_WRITE:
+			case IB_WR_RDMA_READ:
+				/* Nothing to be done - the SG list will be unmapped
+				 * when the SEND completes. */
+				break;
+			default:
+				if (printk_ratelimit())
+					printk(KERN_NOTICE
+						"RDS/IB: %s: unexpected opcode 0x%x in WR!\n",
+						__func__, send->s_wr.opcode);
+				break;
+			}
+
+			send->s_wr.opcode = 0xdead;
+			send->s_wr.num_sge = 1;
+			if (send->s_queued + HZ/2 < jiffies)
+				rds_ib_stats_inc(s_ib_tx_stalled);
+
+			/* If a RDMA operation produced an error, signal this right
+			 * away. If we don't, the subsequent SEND that goes with this
+			 * RDMA will be canceled with ERR_WFLUSH, and the application
+			 * never learn that the RDMA failed. */
+			if (unlikely(wc.status == IB_WC_REM_ACCESS_ERR && send->s_op)) {
+				struct rds_message *rm;
+
+				rm = rds_send_get_message(conn, send->s_op);
+				if (rm)
+					rds_ib_send_rdma_complete(rm, wc.status);
+			}
+
+			oldest = (oldest + 1) % ic->i_send_ring.w_nr;
+		}
+
+		rds_ib_ring_free(&ic->i_send_ring, completed);
+
+		if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)
+		 || test_bit(0, &conn->c_map_queued))
+			queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+
+		/* We expect errors as the qp is drained during shutdown */
+		if (wc.status != IB_WC_SUCCESS && rds_conn_up(conn)) {
+			rds_ib_conn_error(conn,
+				"send completion on %u.%u.%u.%u "
+				"had status %u, disconnecting and reconnecting\n",
+				NIPQUAD(conn->c_faddr), wc.status);
+		}
+	}
+}
+
+/*
+ * This is the main function for allocating credits when sending
+ * messages.
+ *
+ * Conceptually, we have two counters:
+ *  -	send credits: this tells us how many WRs we're allowed
+ *	to submit without overruning the reciever's queue. For
+ *	each SEND WR we post, we decrement this by one.
+ *
+ *  -	posted credits: this tells us how many WRs we recently
+ *	posted to the receive queue. This value is transferred
+ *	to the peer as a "credit update" in a RDS header field.
+ *	Every time we transmit credits to the peer, we subtract
+ *	the amount of transferred credits from this counter.
+ *
+ * It is essential that we avoid situations where both sides have
+ * exhausted their send credits, and are unable to send new credits
+ * to the peer. We achieve this by requiring that we send at least
+ * one credit update to the peer before exhausting our credits.
+ * When new credits arrive, we subtract one credit that is withheld
+ * until we've posted new buffers and are ready to transmit these
+ * credits (see rds_ib_send_add_credits below).
+ *
+ * The RDS send code is essentially single-threaded; rds_send_xmit
+ * grabs c_send_lock to ensure exclusive access to the send ring.
+ * However, the ACK sending code is independent and can race with
+ * message SENDs.
+ *
+ * In the send path, we need to update the counters for send credits
+ * and the counter of posted buffers atomically - when we use the
+ * last available credit, we cannot allow another thread to race us
+ * and grab the posted credits counter.  Hence, we have to use a
+ * spinlock to protect the credit counter, or use atomics.
+ *
+ * Spinlocks shared between the send and the receive path are bad,
+ * because they create unnecessary delays. An early implementation
+ * using a spinlock showed a 5% degradation in throughput at some
+ * loads.
+ *
+ * This implementation avoids spinlocks completely, putting both
+ * counters into a single atomic, and updating that atomic using
+ * atomic_add (in the receive path, when receiving fresh credits),
+ * and using atomic_cmpxchg when updating the two counters.
+ */
+int rds_ib_send_grab_credits(struct rds_ib_connection *ic,
+			     u32 wanted, u32 *adv_credits)
+{
+	unsigned int avail, posted, got = 0, advertise;
+	long oldval, newval;
+
+	*adv_credits = 0;
+	if (!ic->i_flowctl)
+		return wanted;
+
+try_again:
+	advertise = 0;
+	oldval = newval = atomic_read(&ic->i_credits);
+	posted = IB_GET_POST_CREDITS(oldval);
+	avail = IB_GET_SEND_CREDITS(oldval);
+
+	rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n",
+			wanted, avail, posted);
+
+	/* The last credit must be used to send a credit update. */
+	if (avail && !posted)
+		avail--;
+
+	if (avail < wanted) {
+		struct rds_connection *conn = ic->i_cm_id->context;
+
+		/* Oops, there aren't that many credits left! */
+		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
+		got = avail;
+	} else {
+		/* Sometimes you get what you want, lalala. */
+		got = wanted;
+	}
+	newval -= IB_SET_SEND_CREDITS(got);
+
+	if (got && posted) {
+		advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT);
+		newval -= IB_SET_POST_CREDITS(advertise);
+	}
+
+	/* Finally bill everything */
+	if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval)
+		goto try_again;
+
+	*adv_credits = advertise;
+	return got;
+}
+
+void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (credits == 0)
+		return;
+
+	rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n",
+			credits,
+			IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)),
+			test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ? ", ll_send_full" : "");
+
+	atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits);
+	if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+		queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+
+	WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384);
+
+	rds_ib_stats_inc(s_ib_rx_credit_updates);
+}
+
+void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	if (posted == 0)
+		return;
+
+	atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits);
+
+	/* Decide whether to send an update to the peer now.
+	 * If we would send a credit update for every single buffer we
+	 * post, we would end up with an ACK storm (ACK arrives,
+	 * consumes buffer, we refill the ring, send ACK to remote
+	 * advertising the newly posted buffer... ad inf)
+	 *
+	 * Performance pretty much depends on how often we send
+	 * credit updates - too frequent updates mean lots of ACKs.
+	 * Too infrequent updates, and the peer will run out of
+	 * credits and has to throttle.
+	 * For the time being, 16 seems to be a good compromise.
+	 */
+	if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16)
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+}
+
+static inline void
+rds_ib_xmit_populate_wr(struct rds_ib_connection *ic,
+		struct rds_ib_send_work *send, unsigned int pos,
+		unsigned long buffer, unsigned int length,
+		int send_flags)
+{
+	struct ib_sge *sge;
+
+	WARN_ON(pos != send - ic->i_sends);
+
+	send->s_wr.send_flags = send_flags;
+	send->s_wr.opcode = IB_WR_SEND;
+	send->s_wr.num_sge = 2;
+	send->s_wr.next = NULL;
+	send->s_queued = jiffies;
+	send->s_op = NULL;
+
+	if (length != 0) {
+		sge = rds_ib_data_sge(ic, send->s_sge);
+		sge->addr = buffer;
+		sge->length = length;
+		sge->lkey = ic->i_mr->lkey;
+
+		sge = rds_ib_header_sge(ic, send->s_sge);
+	} else {
+		/* We're sending a packet with no payload. There is only
+		 * one SGE */
+		send->s_wr.num_sge = 1;
+		sge = &send->s_sge[0];
+	}
+
+	sge->addr = ic->i_send_hdrs_dma + (pos * sizeof(struct rds_header));
+	sge->length = sizeof(struct rds_header);
+	sge->lkey = ic->i_mr->lkey;
+}
+
+/*
+ * This can be called multiple times for a given message.  The first time
+ * we see a message we map its scatterlist into the IB device so that
+ * we can provide that mapped address to the IB scatter gather entries
+ * in the IB work requests.  We translate the scatterlist into a series
+ * of work requests that fragment the message.  These work requests complete
+ * in order so we pass ownership of the message to the completion handler
+ * once we send the final fragment.
+ *
+ * The RDS core uses the c_send_lock to only enter this function once
+ * per connection.  This makes sure that the tx ring alloc/unalloc pairs
+ * don't get out of sync and confuse the ring.
+ */
+int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
+		unsigned int hdr_off, unsigned int sg, unsigned int off)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_device *dev = ic->i_cm_id->device;
+	struct rds_ib_send_work *send = NULL;
+	struct rds_ib_send_work *first;
+	struct rds_ib_send_work *prev;
+	struct ib_send_wr *failed_wr;
+	struct scatterlist *scat;
+	u32 pos;
+	u32 i;
+	u32 work_alloc;
+	u32 credit_alloc;
+	u32 adv_credits = 0;
+	int send_flags = 0;
+	int sent;
+	int ret;
+
+	BUG_ON(off % RDS_FRAG_SIZE);
+	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
+
+	/* FIXME we may overallocate here */
+	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0)
+		i = 1;
+	else
+		i = ceil(be32_to_cpu(rm->m_inc.i_hdr.h_len), RDS_FRAG_SIZE);
+
+	work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos);
+	if (work_alloc == 0) {
+		set_bit(RDS_LL_SEND_FULL, &conn->c_flags);
+		rds_ib_stats_inc(s_ib_tx_ring_full);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	credit_alloc = work_alloc;
+	if (ic->i_flowctl) {
+		credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &adv_credits);
+		if (credit_alloc < work_alloc) {
+			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc);
+			work_alloc = credit_alloc;
+		}
+		if (work_alloc == 0) {
+			rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+			rds_ib_stats_inc(s_ib_tx_throttle);
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	/* map the message the first time we see it */
+	if (ic->i_rm == NULL) {
+		/*
+		printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(rm->m_inc.i_hdr.h_dport),
+				rm->m_inc.i_hdr.h_flags,
+				be32_to_cpu(rm->m_inc.i_hdr.h_len));
+		   */
+		if (rm->m_nents) {
+			rm->m_count = ib_dma_map_sg(dev,
+					 rm->m_sg, rm->m_nents, DMA_TO_DEVICE);
+			rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count);
+			if (rm->m_count == 0) {
+				rds_ib_stats_inc(s_ib_tx_sg_mapping_failure);
+				rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+				ret = -ENOMEM; /* XXX ? */
+				goto out;
+			}
+		} else {
+			rm->m_count = 0;
+		}
+
+		ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs;
+		ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes;
+		rds_message_addref(rm);
+		ic->i_rm = rm;
+
+		/* Finalize the header */
+		if (test_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags))
+			rm->m_inc.i_hdr.h_flags |= RDS_FLAG_ACK_REQUIRED;
+		if (test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags))
+			rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED;
+
+		/* If it has a RDMA op, tell the peer we did it. This is
+		 * used by the peer to release use-once RDMA MRs. */
+		if (rm->m_rdma_op) {
+			struct rds_ext_header_rdma ext_hdr;
+
+			ext_hdr.h_rdma_rkey = cpu_to_be32(rm->m_rdma_op->r_key);
+			rds_message_add_extension(&rm->m_inc.i_hdr,
+					RDS_EXTHDR_RDMA, &ext_hdr, sizeof(ext_hdr));
+		}
+		if (rm->m_rdma_cookie) {
+			rds_message_add_rdma_dest_extension(&rm->m_inc.i_hdr,
+					rds_rdma_cookie_key(rm->m_rdma_cookie),
+					rds_rdma_cookie_offset(rm->m_rdma_cookie));
+		}
+
+		/* Note - rds_ib_piggyb_ack clears the ACK_REQUIRED bit, so
+		 * we should not do this unless we have a chance of at least
+		 * sticking the header into the send ring. Which is why we
+		 * should call rds_ib_ring_alloc first. */
+		rm->m_inc.i_hdr.h_ack = cpu_to_be64(rds_ib_piggyb_ack(ic));
+		rds_message_make_checksum(&rm->m_inc.i_hdr);
+	} else if (ic->i_rm != rm)
+		BUG();
+
+	send = &ic->i_sends[pos];
+	first = send;
+	prev = NULL;
+	scat = &rm->m_sg[sg];
+	sent = 0;
+	i = 0;
+
+	/* Sometimes you want to put a fence between an RDMA
+	 * READ and the following SEND.
+	 * We could either do this all the time
+	 * or when requested by the user. Right now, we let
+	 * the application choose.
+	 */
+	if (rm->m_rdma_op && rm->m_rdma_op->r_fence)
+		send_flags = IB_SEND_FENCE;
+
+	/*
+	 * We could be copying the header into the unused tail of the page.
+	 * That would need to be changed in the future when those pages might
+	 * be mapped userspace pages or page cache pages.  So instead we always
+	 * use a second sge and our long-lived ring of mapped headers.  We send
+	 * the header after the data so that the data payload can be aligned on
+	 * the receiver.
+	 */
+
+	/* handle a 0-len message */
+	if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) {
+		rds_ib_xmit_populate_wr(ic, send, pos, 0, 0, send_flags);
+		goto add_header;
+	}
+
+	/* if there's data reference it with a chain of work reqs */
+	for (; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) {
+		unsigned int len;
+
+		send = &ic->i_sends[pos];
+
+		len = min(RDS_FRAG_SIZE, ib_sg_dma_len(dev, scat) - off);
+		rds_ib_xmit_populate_wr(ic, send, pos,
+				ib_sg_dma_address(dev, scat) + off, len,
+				send_flags);
+
+		/*
+		 * We want to delay signaling completions just enough to get
+		 * the batching benefits but not so much that we create dead time
+		 * on the wire.
+		 */
+		if (ic->i_unsignaled_wrs-- == 0) {
+			ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs;
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		}
+
+		ic->i_unsignaled_bytes -= len;
+		if (ic->i_unsignaled_bytes <= 0) {
+			ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes;
+			send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		}
+
+		rdsdebug("send %p wr %p num_sge %u next %p\n", send,
+			 &send->s_wr, send->s_wr.num_sge, send->s_wr.next);
+
+		sent += len;
+		off += len;
+		if (off == ib_sg_dma_len(dev, scat)) {
+			scat++;
+			off = 0;
+		}
+
+add_header:
+		/* Tack on the header after the data. The header SGE should already
+		 * have been set up to point to the right header buffer. */
+		memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header));
+
+		if (0) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n",
+				be16_to_cpu(hdr->h_dport),
+				hdr->h_flags,
+				be32_to_cpu(hdr->h_len));
+		}
+		if (adv_credits) {
+			struct rds_header *hdr = &ic->i_send_hdrs[pos];
+
+			/* add credit and redo the header checksum */
+			hdr->h_credit = adv_credits;
+			rds_message_make_checksum(hdr);
+			adv_credits = 0;
+			rds_ib_stats_inc(s_ib_tx_credit_updates);
+		}
+
+		if (prev)
+			prev->s_wr.next = &send->s_wr;
+		prev = send;
+
+		pos = (pos + 1) % ic->i_send_ring.w_nr;
+	}
+
+	/* Account the RDS header in the number of bytes we sent, but just once.
+	 * The caller has no concept of fragmentation. */
+	if (hdr_off == 0)
+		sent += sizeof(struct rds_header);
+
+	/* if we finished the message then send completion owns it */
+	if (scat == &rm->m_sg[rm->m_count]) {
+		prev->s_rm = ic->i_rm;
+		prev->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+		ic->i_rm = NULL;
+	}
+
+	if (i < work_alloc) {
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i);
+		work_alloc = i;
+	}
+	if (ic->i_flowctl && i < credit_alloc)
+		rds_ib_send_add_credits(conn, credit_alloc - i);
+
+	/* XXX need to worry about failed_wr and partial sends. */
+	failed_wr = &first->s_wr;
+	ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr);
+	rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic,
+		 first, &first->s_wr, ret, failed_wr);
+	BUG_ON(failed_wr != &first->s_wr);
+	if (ret) {
+		printk(KERN_WARNING "RDS/IB: ib_post_send to %u.%u.%u.%u "
+		       "returned %d\n", NIPQUAD(conn->c_faddr), ret);
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+		if (prev->s_rm) {
+			ic->i_rm = prev->s_rm;
+			prev->s_rm = NULL;
+		}
+		/* Finesse this later */
+		BUG();
+		goto out;
+	}
+
+	ret = sent;
+out:
+	BUG_ON(adv_credits);
+	return ret;
+}
+
+int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_send_work *send = NULL;
+	struct rds_ib_send_work *first;
+	struct rds_ib_send_work *prev;
+	struct ib_send_wr *failed_wr;
+	struct rds_ib_device *rds_ibdev;
+	struct scatterlist *scat;
+	unsigned long len;
+	u64 remote_addr = op->r_remote_addr;
+	u32 pos;
+	u32 work_alloc;
+	u32 i;
+	u32 j;
+	int sent;
+	int ret;
+	int num_sge;
+
+	rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client);
+
+	/* map the message the first time we see it */
+	if (!op->r_mapped) {
+		op->r_count = ib_dma_map_sg(ic->i_cm_id->device,
+					op->r_sg, op->r_nents, (op->r_write) ?
+					DMA_TO_DEVICE : DMA_FROM_DEVICE);
+		rdsdebug("ic %p mapping op %p: %d\n", ic, op, op->r_count);
+		if (op->r_count == 0) {
+			rds_ib_stats_inc(s_ib_tx_sg_mapping_failure);
+			ret = -ENOMEM; /* XXX ? */
+			goto out;
+		}
+
+		op->r_mapped = 1;
+	}
+
+	/*
+	 * Instead of knowing how to return a partial rdma read/write we insist that there
+	 * be enough work requests to send the entire message.
+	 */
+	i = ceil(op->r_count, rds_ibdev->max_sge);
+
+	work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos);
+	if (work_alloc != i) {
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+		rds_ib_stats_inc(s_ib_tx_ring_full);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	send = &ic->i_sends[pos];
+	first = send;
+	prev = NULL;
+	scat = &op->r_sg[0];
+	sent = 0;
+	num_sge = op->r_count;
+
+	for (i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++) {
+		send->s_wr.send_flags = 0;
+		send->s_queued = jiffies;
+		/*
+		 * We want to delay signaling completions just enough to get
+		 * the batching benefits but not so much that we create dead time on the wire.
+		 */
+		if (ic->i_unsignaled_wrs-- == 0) {
+			ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs;
+			send->s_wr.send_flags = IB_SEND_SIGNALED;
+		}
+
+		send->s_wr.opcode = op->r_write ? IB_WR_RDMA_WRITE : IB_WR_RDMA_READ;
+		send->s_wr.wr.rdma.remote_addr = remote_addr;
+		send->s_wr.wr.rdma.rkey = op->r_key;
+		send->s_op = op;
+
+		if (num_sge > rds_ibdev->max_sge) {
+			send->s_wr.num_sge = rds_ibdev->max_sge;
+			num_sge -= rds_ibdev->max_sge;
+		} else {
+			send->s_wr.num_sge = num_sge;
+		}
+
+		send->s_wr.next = NULL;
+
+		if (prev)
+			prev->s_wr.next = &send->s_wr;
+
+		for (j = 0; j < send->s_wr.num_sge && scat != &op->r_sg[op->r_count]; j++) {
+			len = ib_sg_dma_len(ic->i_cm_id->device, scat);
+			send->s_sge[j].addr =
+				 ib_sg_dma_address(ic->i_cm_id->device, scat);
+			send->s_sge[j].length = len;
+			send->s_sge[j].lkey = ic->i_mr->lkey;
+
+			sent += len;
+			rdsdebug("ic %p sent %d remote_addr %llu\n", ic, sent, remote_addr);
+
+			remote_addr += len;
+			scat++;
+		}
+
+		rdsdebug("send %p wr %p num_sge %u next %p\n", send,
+			&send->s_wr, send->s_wr.num_sge, send->s_wr.next);
+
+		prev = send;
+		if (++send == &ic->i_sends[ic->i_send_ring.w_nr])
+			send = ic->i_sends;
+	}
+
+	/* if we finished the message then send completion owns it */
+	if (scat == &op->r_sg[op->r_count])
+		prev->s_wr.send_flags = IB_SEND_SIGNALED;
+
+	if (i < work_alloc) {
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i);
+		work_alloc = i;
+	}
+
+	failed_wr = &first->s_wr;
+	ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr);
+	rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic,
+		 first, &first->s_wr, ret, failed_wr);
+	BUG_ON(failed_wr != &first->s_wr);
+	if (ret) {
+		printk(KERN_WARNING "RDS/IB: rdma ib_post_send to %u.%u.%u.%u "
+		       "returned %d\n", NIPQUAD(conn->c_faddr), ret);
+		rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc);
+		goto out;
+	}
+
+	if (unlikely(failed_wr != &first->s_wr)) {
+		printk(KERN_WARNING "RDS/IB: ib_post_send() rc=%d, but failed_wqe updated!\n", ret);
+		BUG_ON(failed_wr != &first->s_wr);
+	}
+
+
+out:
+	return ret;
+}
+
+void rds_ib_xmit_complete(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+
+	/* We may have a pending ACK or window update we were unable
+	 * to send previously (due to flow control). Try again. */
+	rds_ib_attempt_ack(ic);
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 17/21] RDS/IB: Receive datagrams via IB
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (15 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 16/21] RDS/IB: Implement IB-specific datagram send Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-29  0:05   ` [ofa-general] " Roland Dreier
  2009-01-27  2:17 ` [PATCH 18/21] RDS/IB: Stats and sysctls Andy Grover
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Header parsing, ring refill. It puts the incoming data into an
rds_incoming struct, which is passed up to rds-core.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_recv.c |  894 ++++++++++++++++++++++++++++++++++
 1 files changed, 894 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c

diff --git a/drivers/infiniband/ulp/rds/ib_recv.c b/drivers/infiniband/ulp/rds/ib_recv.c
new file mode 100644
index 0000000..516f858
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_recv.c
@@ -0,0 +1,894 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/dma-mapping.h>
+#include <rdma/rdma_cm.h>
+
+#include "rds.h"
+#include "ib.h"
+
+static struct kmem_cache *rds_ib_incoming_slab;
+static struct kmem_cache *rds_ib_frag_slab;
+static atomic_t	rds_ib_allocation = ATOMIC_INIT(0);
+
+static void rds_ib_frag_drop_page(struct rds_page_frag *frag)
+{
+	rdsdebug("frag %p page %p\n", frag, frag->f_page);
+	__free_page(frag->f_page);
+	frag->f_page = NULL;
+}
+
+static void rds_ib_frag_free(struct rds_page_frag *frag)
+{
+	rdsdebug("frag %p page %p\n", frag, frag->f_page);
+	BUG_ON(frag->f_page != NULL);
+	kmem_cache_free(rds_ib_frag_slab, frag);
+}
+
+/*
+ * We map a page at a time.  Its fragments are posted in order.  This
+ * is called in fragment order as the fragments get send completion events.
+ * Only the last frag in the page performs the unmapping.
+ *
+ * It's OK for ring cleanup to call this in whatever order it likes because
+ * DMA is not in flight and so we can unmap while other ring entries still
+ * hold page references in their frags.
+ */
+static void rds_ib_recv_unmap_page(struct rds_ib_connection *ic,
+				   struct rds_ib_recv_work *recv)
+{
+	struct rds_page_frag *frag = recv->r_frag;
+
+	rdsdebug("recv %p frag %p page %p\n", recv, frag, frag->f_page);
+	if (frag->f_mapped)
+		ib_dma_unmap_page(ic->i_cm_id->device,
+			       frag->f_mapped,
+			       RDS_FRAG_SIZE, DMA_FROM_DEVICE);
+	frag->f_mapped = 0;
+}
+
+void rds_ib_recv_init_ring(struct rds_ib_connection *ic)
+{
+	struct rds_ib_recv_work *recv;
+	u32 i;
+
+	for (i = 0, recv = ic->i_recvs; i < ic->i_recv_ring.w_nr; i++, recv++) {
+		struct ib_sge *sge;
+
+		recv->r_ibinc = NULL;
+		recv->r_frag = NULL;
+
+		recv->r_wr.next = NULL;
+		recv->r_wr.wr_id = i;
+		recv->r_wr.sg_list = recv->r_sge;
+		recv->r_wr.num_sge = RDS_IB_RECV_SGE;
+
+		sge = rds_ib_data_sge(ic, recv->r_sge);
+		sge->addr = 0;
+		sge->length = RDS_FRAG_SIZE;
+		sge->lkey = ic->i_mr->lkey;
+
+		sge = rds_ib_header_sge(ic, recv->r_sge);
+		sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header));
+		sge->length = sizeof(struct rds_header);
+		sge->lkey = ic->i_mr->lkey;
+	}
+}
+
+static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
+				  struct rds_ib_recv_work *recv)
+{
+	if (recv->r_ibinc) {
+		rds_inc_put(&recv->r_ibinc->ii_inc);
+		recv->r_ibinc = NULL;
+	}
+	if (recv->r_frag) {
+		rds_ib_recv_unmap_page(ic, recv);
+		if (recv->r_frag->f_page)
+			rds_ib_frag_drop_page(recv->r_frag);
+		rds_ib_frag_free(recv->r_frag);
+		recv->r_frag = NULL;
+	}
+}
+
+void rds_ib_recv_clear_ring(struct rds_ib_connection *ic)
+{
+	u32 i;
+
+	for (i = 0; i < ic->i_recv_ring.w_nr; i++)
+		rds_ib_recv_clear_one(ic, &ic->i_recvs[i]);
+
+	if (ic->i_frag.f_page)
+		rds_ib_frag_drop_page(&ic->i_frag);
+}
+
+static int rds_ib_recv_refill_one(struct rds_connection *conn,
+				  struct rds_ib_recv_work *recv,
+				  gfp_t kptr_gfp, gfp_t page_gfp)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	dma_addr_t dma_addr;
+	struct ib_sge *sge;
+	int ret = -ENOMEM;
+
+	if (recv->r_ibinc == NULL) {
+		if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) {
+			rds_ib_stats_inc(s_ib_rx_alloc_limit);
+			goto out;
+		}
+		recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab,
+						 kptr_gfp);
+		if (recv->r_ibinc == NULL)
+			goto out;
+		atomic_inc(&rds_ib_allocation);
+		INIT_LIST_HEAD(&recv->r_ibinc->ii_frags);
+		rds_inc_init(&recv->r_ibinc->ii_inc, conn, conn->c_faddr);
+	}
+
+	if (recv->r_frag == NULL) {
+		recv->r_frag = kmem_cache_alloc(rds_ib_frag_slab, kptr_gfp);
+		if (recv->r_frag == NULL)
+			goto out;
+		INIT_LIST_HEAD(&recv->r_frag->f_item);
+		recv->r_frag->f_page = NULL;
+	}
+
+	if (ic->i_frag.f_page == NULL) {
+		ic->i_frag.f_page = alloc_page(page_gfp);
+		if (ic->i_frag.f_page == NULL)
+			goto out;
+		ic->i_frag.f_offset = 0;
+	}
+
+	dma_addr = ib_dma_map_page(ic->i_cm_id->device,
+				  ic->i_frag.f_page,
+				  ic->i_frag.f_offset,
+				  RDS_FRAG_SIZE,
+				  DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(ic->i_cm_id->device, dma_addr))
+		goto out;
+
+	/*
+	 * Once we get the RDS_PAGE_LAST_OFF frag then rds_ib_frag_unmap()
+	 * must be called on this recv.  This happens as completions hit
+	 * in order or on connection shutdown.
+	 */
+	recv->r_frag->f_page = ic->i_frag.f_page;
+	recv->r_frag->f_offset = ic->i_frag.f_offset;
+	recv->r_frag->f_mapped = dma_addr;
+
+	sge = rds_ib_data_sge(ic, recv->r_sge);
+	sge->addr = dma_addr;
+	sge->length = RDS_FRAG_SIZE;
+
+	sge = rds_ib_header_sge(ic, recv->r_sge);
+	sge->addr = ic->i_recv_hdrs_dma + (recv - ic->i_recvs) * sizeof(struct rds_header);
+	sge->length = sizeof(struct rds_header);
+
+	get_page(recv->r_frag->f_page);
+
+	if (ic->i_frag.f_offset < RDS_PAGE_LAST_OFF) {
+		ic->i_frag.f_offset += RDS_FRAG_SIZE;
+	} else {
+		put_page(ic->i_frag.f_page);
+		ic->i_frag.f_page = NULL;
+		ic->i_frag.f_offset = 0;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+/*
+ * This tries to allocate and post unused work requests after making sure that
+ * they have all the allocations they need to queue received fragments into
+ * sockets.  The i_recv_mutex is held here so that ring_alloc and _unalloc
+ * pairs don't go unmatched.
+ *
+ * -1 is returned if posting fails due to temporary resource exhaustion.
+ */
+int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp,
+		       gfp_t page_gfp, int prefill)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_recv_work *recv;
+	struct ib_recv_wr *failed_wr;
+	unsigned int posted = 0;
+	int ret = 0;
+	u32 pos;
+
+	while ((prefill || rds_conn_up(conn))
+			&& rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) {
+		if (pos >= ic->i_recv_ring.w_nr) {
+			printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n",
+					pos);
+			ret = -EINVAL;
+			break;
+		}
+
+		recv = &ic->i_recvs[pos];
+		ret = rds_ib_recv_refill_one(conn, recv, kptr_gfp, page_gfp);
+		if (ret) {
+			ret = -1;
+			break;
+		}
+
+		/* XXX when can this fail? */
+		ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
+		rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
+			 recv->r_ibinc, recv->r_frag->f_page,
+			 (long) recv->r_frag->f_mapped, ret);
+		if (ret) {
+			rds_ib_conn_error(conn, "recv post on "
+			       "%u.%u.%u.%u returned %d, disconnecting and "
+			       "reconnecting\n", NIPQUAD(conn->c_faddr),
+			       ret);
+			ret = -1;
+			break;
+		}
+
+		posted++;
+	}
+
+	/* We're doing flow control - update the window. */
+	if (ic->i_flowctl && posted)
+		rds_ib_advertise_credits(conn, posted);
+
+	if (ret)
+		rds_ib_ring_unalloc(&ic->i_recv_ring, 1);
+	return ret;
+}
+
+void rds_ib_inc_purge(struct rds_incoming *inc)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+	struct rds_page_frag *pos;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	rdsdebug("purging ibinc %p inc %p\n", ibinc, inc);
+
+	list_for_each_entry_safe(frag, pos, &ibinc->ii_frags, f_item) {
+		list_del_init(&frag->f_item);
+		rds_ib_frag_drop_page(frag);
+		rds_ib_frag_free(frag);
+	}
+}
+
+void rds_ib_inc_free(struct rds_incoming *inc)
+{
+	struct rds_ib_incoming *ibinc;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+
+	rds_ib_inc_purge(inc);
+	rdsdebug("freeing ibinc %p inc %p\n", ibinc, inc);
+	BUG_ON(!list_empty(&ibinc->ii_frags));
+	kmem_cache_free(rds_ib_incoming_slab, ibinc);
+	atomic_dec(&rds_ib_allocation);
+	BUG_ON(atomic_read(&rds_ib_allocation) < 0);
+}
+
+int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *first_iov,
+			    size_t size)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+	struct iovec *iov = first_iov;
+	unsigned long to_copy;
+	unsigned long frag_off = 0;
+	unsigned long iov_off = 0;
+	int copied = 0;
+	int ret;
+	u32 len;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	len = be32_to_cpu(inc->i_hdr.h_len);
+
+	while (copied < size && copied < len) {
+		if (frag_off == RDS_FRAG_SIZE) {
+			frag = list_entry(frag->f_item.next,
+					  struct rds_page_frag, f_item);
+			frag_off = 0;
+		}
+		while (iov_off == iov->iov_len) {
+			iov_off = 0;
+			iov++;
+		}
+
+		to_copy = min(iov->iov_len - iov_off, RDS_FRAG_SIZE - frag_off);
+		to_copy = min_t(size_t, to_copy, size - copied);
+		to_copy = min_t(unsigned long, to_copy, len - copied);
+
+		rdsdebug("%lu bytes to user [%p, %zu] + %lu from frag "
+			 "[%p, %lu] + %lu\n",
+			 to_copy, iov->iov_base, iov->iov_len, iov_off,
+			 frag->f_page, frag->f_offset, frag_off);
+
+		/* XXX needs + offset for multiple recvs per page */
+		ret = rds_page_copy_to_user(frag->f_page,
+					    frag->f_offset + frag_off,
+					    iov->iov_base + iov_off,
+					    to_copy);
+		if (ret) {
+			copied = ret;
+			break;
+		}
+
+		iov_off += to_copy;
+		frag_off += to_copy;
+		copied += to_copy;
+	}
+
+	return copied;
+}
+
+/* ic starts out kzalloc()ed */
+void rds_ib_recv_init_ack(struct rds_ib_connection *ic)
+{
+	struct ib_send_wr *wr = &ic->i_ack_wr;
+	struct ib_sge *sge = &ic->i_ack_sge;
+
+	sge->addr = ic->i_ack_dma;
+	sge->length = sizeof(struct rds_header);
+	sge->lkey = ic->i_mr->lkey;
+
+	wr->sg_list = sge;
+	wr->num_sge = 1;
+	wr->opcode = IB_WR_SEND;
+	wr->wr_id = RDS_IB_ACK_WR_ID;
+	wr->send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
+}
+
+/*
+ * You'd think that with reliable IB connections you wouldn't need to ack
+ * messages that have been received.  The problem is that IB hardware generates
+ * an ack message before it has DMAed the message into memory.  This creates a
+ * potential message loss if the HCA is disabled for any reason between when it
+ * sends the ack and before the message is DMAed and processed.  This is only a
+ * potential issue if another HCA is available for fail-over.
+ *
+ * When the remote host receives our ack they'll free the sent message from
+ * their send queue.  To decrease the latency of this we always send an ack
+ * immediately after we've received messages.
+ *
+ * For simplicity, we only have one ack in flight at a time.  This puts
+ * pressure on senders to have deep enough send queues to absorb the latency of
+ * a single ack frame being in flight.  This might not be good enough.
+ *
+ * This is implemented by have a long-lived send_wr and sge which point to a
+ * statically allocated ack frame.  This ack wr does not fall under the ring
+ * accounting that the tx and rx wrs do.  The QP attribute specifically makes
+ * room for it beyond the ring size.  Send completion notices its special
+ * wr_id and avoids working with the ring in that case.
+ */
+#ifndef KERNEL_HAS_ATOMIC64
+static void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq,
+				int ack_required)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ic->i_ack_lock, flags);
+	ic->i_ack_next = seq;
+	if (ack_required)
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	spin_unlock_irqrestore(&ic->i_ack_lock, flags);
+}
+
+static u64 rds_ib_get_ack(struct rds_ib_connection *ic)
+{
+	unsigned long flags;
+	u64 seq;
+
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+
+	spin_lock_irqsave(&ic->i_ack_lock, flags);
+	seq = ic->i_ack_next;
+	spin_unlock_irqrestore(&ic->i_ack_lock, flags);
+
+	return seq;
+}
+#else
+static void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq,
+				int ack_required)
+{
+	atomic64_set(&ic->i_ack_next, seq);
+	if (ack_required) {
+		smp_mb__before_clear_bit();
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	}
+}
+
+static u64 rds_ib_get_ack(struct rds_ib_connection *ic)
+{
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	smp_mb__after_clear_bit();
+
+	return atomic64_read(&ic->i_ack_next);
+}
+#endif
+
+
+static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits)
+{
+	struct rds_header *hdr = ic->i_ack;
+	struct ib_send_wr *failed_wr;
+	u64 seq;
+	int ret;
+
+	seq = rds_ib_get_ack(ic);
+
+	rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq);
+	rds_message_populate_header(hdr, 0, 0, 0);
+	hdr->h_ack = cpu_to_be64(seq);
+	hdr->h_credit = adv_credits;
+	rds_message_make_checksum(hdr);
+	ic->i_ack_queued = jiffies;
+
+	ret = ib_post_send(ic->i_cm_id->qp, &ic->i_ack_wr, &failed_wr);
+	if (unlikely(ret)) {
+		/* Failed to send. Release the WR, and
+		 * force another ACK.
+		 */
+		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+		set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+
+		rds_ib_stats_inc(s_ib_ack_send_failure);
+		/* Need to finesse this later. */
+		BUG();
+	} else
+		rds_ib_stats_inc(s_ib_ack_sent);
+}
+
+/*
+ * There are 3 ways of getting acknowledgements to the peer:
+ *  1.	We call rds_ib_attempt_ack from the recv completion handler
+ *	to send an ACK-only frame.
+ *	However, there can be only one such frame in the send queue
+ *	at any time, so we may have to postpone it.
+ *  2.	When another (data) packet is transmitted while there's
+ *	an ACK in the queue, we piggyback the ACK sequence number
+ *	on the data packet.
+ *  3.	If the ACK WR is done sending, we get called from the
+ *	send queue completion handler, and check whether there's
+ *	another ACK pending (postponed because the WR was on the
+ *	queue). If so, we transmit it.
+ *
+ * We maintain 2 variables:
+ *  -	i_ack_flags, which keeps track of whether the ACK WR
+ *	is currently in the send queue or not (IB_ACK_IN_FLIGHT)
+ *  -	i_ack_next, which is the last sequence number we received
+ *
+ * Potentially, send queue and receive queue handlers can run concurrently.
+ * It would be nice to not have to use a spinlock to synchronize things,
+ * but the one problem that rules this out is that 64bit updates are
+ * not atomic on all platforms. Things would be a lot simpler if
+ * we had atomic64 or maybe cmpxchg64 everywhere.
+ *
+ * Reconnecting complicates this picture just slightly. When we
+ * reconnect, we may be seeing duplicate packets. The peer
+ * is retransmitting them, because it hasn't seen an ACK for
+ * them. It is important that we ACK these.
+ *
+ * ACK mitigation adds a header flag "ACK_REQUIRED"; any packet with
+ * this flag set *MUST* be acknowledged immediately.
+ */
+
+/*
+ * When we get here, we're called from the recv queue handler.
+ * Check whether we ought to transmit an ACK.
+ */
+void rds_ib_attempt_ack(struct rds_ib_connection *ic)
+{
+	unsigned int adv_credits;
+
+	if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
+		return;
+
+	if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) {
+		rds_ib_stats_inc(s_ib_ack_send_delayed);
+		return;
+	}
+
+	/* Can we get a send credit? */
+	if (!rds_ib_send_grab_credits(ic, 1, &adv_credits)) {
+		rds_ib_stats_inc(s_ib_tx_throttle);
+		clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+		return;
+	}
+
+	clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags);
+	rds_ib_send_ack(ic, adv_credits);
+}
+
+/*
+ * We get here from the send completion handler, when the
+ * adapter tells us the ACK frame was sent.
+ */
+void rds_ib_ack_send_complete(struct rds_ib_connection *ic)
+{
+	clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags);
+	rds_ib_attempt_ack(ic);
+}
+
+/*
+ * This is called by the regular xmit code when it wants to piggyback
+ * an ACK on an outgoing frame.
+ */
+u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic)
+{
+	if (test_and_clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags))
+		rds_ib_stats_inc(s_ib_ack_send_piggybacked);
+	return rds_ib_get_ack(ic);
+}
+
+/*
+ * It's kind of lame that we're copying from the posted receive pages into
+ * long-lived bitmaps.  We could have posted the bitmaps and rdma written into
+ * them.  But receiving new congestion bitmaps should be a *rare* event, so
+ * hopefully we won't need to invest that complexity in making it more
+ * efficient.  By copying we can share a simpler core with TCP which has to
+ * copy.
+ */
+static void rds_ib_cong_recv(struct rds_connection *conn,
+			      struct rds_ib_incoming *ibinc)
+{
+	struct rds_cong_map *map;
+	unsigned int map_off;
+	unsigned int map_page;
+	struct rds_page_frag *frag;
+	unsigned long frag_off;
+	unsigned long to_copy;
+	unsigned long copied;
+	uint64_t uncongested = 0;
+	void *addr;
+
+	/* catch completely corrupt packets */
+	if (be32_to_cpu(ibinc->ii_inc.i_hdr.h_len) != RDS_CONG_MAP_BYTES)
+		return;
+
+	map = conn->c_fcong;
+	map_page = 0;
+	map_off = 0;
+
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	frag_off = 0;
+
+	copied = 0;
+
+	while (copied < RDS_CONG_MAP_BYTES) {
+		uint64_t *src, *dst;
+		unsigned int k;
+
+		to_copy = min(RDS_FRAG_SIZE - frag_off, PAGE_SIZE - map_off);
+		BUG_ON(to_copy & 7); /* Must be 64bit aligned. */
+
+		addr = kmap_atomic(frag->f_page, KM_SOFTIRQ0);
+
+		src = addr + frag_off;
+		dst = (void *)map->m_page_addrs[map_page] + map_off;
+		for (k = 0; k < to_copy; k += 8) {
+			/* Record ports that became uncongested, ie
+			 * bits that changed from 0 to 1. */
+			uncongested |= ~(*src) & *dst;
+			*dst++ = *src++;
+		}
+		kunmap_atomic(addr, KM_SOFTIRQ0);
+
+		copied += to_copy;
+
+		map_off += to_copy;
+		if (map_off == PAGE_SIZE) {
+			map_off = 0;
+			map_page++;
+		}
+
+		frag_off += to_copy;
+		if (frag_off == RDS_FRAG_SIZE) {
+			frag = list_entry(frag->f_item.next,
+					  struct rds_page_frag, f_item);
+			frag_off = 0;
+		}
+	}
+
+	/* the congestion map is in little endian order */
+	uncongested = le64_to_cpu(uncongested);
+
+	rds_cong_map_updated(map, uncongested);
+}
+
+/*
+ * Rings are posted with all the allocations they'll need to queue the
+ * incoming message to the receiving socket so this can't fail.
+ * All fragments start with a header, so we can make sure we're not receiving
+ * garbage, and we can tell a small 8 byte fragment from an ACK frame.
+ */
+struct rds_ib_ack_state {
+	u64		ack_next;
+	u64		ack_recv;
+	unsigned int	ack_required:1;
+	unsigned int	ack_next_valid:1;
+	unsigned int	ack_recv_valid:1;
+};
+
+static void rds_ib_process_recv(struct rds_connection *conn,
+				struct rds_ib_recv_work *recv, u32 byte_len,
+				struct rds_ib_ack_state *state)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct rds_ib_incoming *ibinc = ic->i_ibinc;
+	struct rds_header *ihdr, *hdr;
+
+	/* XXX shut down the connection if port 0,0 are seen? */
+
+	rdsdebug("ic %p ibinc %p recv %p byte len %u\n", ic, ibinc, recv,
+		 byte_len);
+
+	if (byte_len < sizeof(struct rds_header)) {
+		rds_ib_conn_error(conn, "incoming message "
+		       "from %u.%u.%u.%u didn't inclue a "
+		       "header, disconnecting and "
+		       "reconnecting\n",
+		       NIPQUAD(conn->c_faddr));
+		return;
+	}
+	byte_len -= sizeof(struct rds_header);
+
+	ihdr = &ic->i_recv_hdrs[recv - ic->i_recvs];
+
+	/* Validate the checksum. */
+	if (!rds_message_verify_checksum(ihdr)) {
+		rds_ib_conn_error(conn, "incoming message "
+		       "from %u.%u.%u.%u has corrupted header - "
+		       "forcing a reconnect\n",
+		       NIPQUAD(conn->c_faddr));
+		rds_stats_inc(s_recv_drop_bad_checksum);
+		return;
+	}
+
+	/* Process the ACK sequence which comes with every packet */
+	state->ack_recv = be64_to_cpu(ihdr->h_ack);
+	state->ack_recv_valid = 1;
+
+	/* Process the credits update if there was one */
+	if (ihdr->h_credit)
+		rds_ib_send_add_credits(conn, ihdr->h_credit);
+
+	if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) {
+		/* This is an ACK-only packet. The fact that it gets
+		 * special treatment here is that historically, ACKs
+		 * were rather special beasts.
+		 */
+		rds_ib_stats_inc(s_ib_ack_received);
+
+		/*
+		 * Usually the frags make their way on to incs and are then freed as
+		 * the inc is freed.  We don't go that route, so we have to drop the
+		 * page ref ourselves.  We can't just leave the page on the recv
+		 * because that confuses the dma mapping of pages and each recv's use
+		 * of a partial page.  We can leave the frag, though, it will be
+		 * reused.
+		 *
+		 * FIXME: Fold this into the code path below.
+		 */
+		rds_ib_frag_drop_page(recv->r_frag);
+		return;
+	}
+
+	/*
+	 * If we don't already have an inc on the connection then this
+	 * fragment has a header and starts a message.. copy its header
+	 * into the inc and save the inc so we can hang upcoming fragments
+	 * off its list.
+	 */
+	if (ibinc == NULL) {
+		ibinc = recv->r_ibinc;
+		recv->r_ibinc = NULL;
+		ic->i_ibinc = ibinc;
+
+		hdr = &ibinc->ii_inc.i_hdr;
+		memcpy(hdr, ihdr, sizeof(*hdr));
+		ic->i_recv_data_rem = be32_to_cpu(hdr->h_len);
+
+		rdsdebug("ic %p ibinc %p rem %u flag 0x%x\n", ic, ibinc,
+			 ic->i_recv_data_rem, hdr->h_flags);
+	} else {
+		hdr = &ibinc->ii_inc.i_hdr;
+		/* We can't just use memcmp here; fragments of a
+		 * single message may carry different ACKs */
+		if (hdr->h_sequence != ihdr->h_sequence
+		 || hdr->h_len != ihdr->h_len
+		 || hdr->h_sport != ihdr->h_sport
+		 || hdr->h_dport != ihdr->h_dport) {
+			rds_ib_conn_error(conn,
+				"fragment header mismatch; forcing reconnect\n");
+			return;
+		}
+	}
+
+	list_add_tail(&recv->r_frag->f_item, &ibinc->ii_frags);
+	recv->r_frag = NULL;
+
+	if (ic->i_recv_data_rem > RDS_FRAG_SIZE)
+		ic->i_recv_data_rem -= RDS_FRAG_SIZE;
+	else {
+		ic->i_recv_data_rem = 0;
+		ic->i_ibinc = NULL;
+
+		if (ibinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP)
+			rds_ib_cong_recv(conn, ibinc);
+		else {
+			rds_recv_incoming(conn, conn->c_faddr, conn->c_laddr,
+					  &ibinc->ii_inc, GFP_ATOMIC,
+					  KM_SOFTIRQ0);
+			state->ack_next = be64_to_cpu(hdr->h_sequence);
+			state->ack_next_valid = 1;
+		}
+
+		/* Evaluate the ACK_REQUIRED flag *after* we received
+		 * the complete frame, and after bumping the next_rx
+		 * sequence. */
+		if (hdr->h_flags & RDS_FLAG_ACK_REQUIRED) {
+			rds_stats_inc(s_recv_ack_required);
+			state->ack_required = 1;
+		}
+
+		rds_inc_put(&ibinc->ii_inc);
+	}
+}
+
+/*
+ * Plucking the oldest entry from the ring can be done concurrently with
+ * the thread refilling the ring.  Each ring operation is protected by
+ * spinlocks and the transient state of refilling doesn't change the
+ * recording of which entry is oldest.
+ *
+ * This relies on IB only calling one cq comp_handler for each cq so that
+ * there will only be one caller of rds_recv_incoming() per RDS connection.
+ */
+void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context)
+{
+	struct rds_connection *conn = context;
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	struct ib_wc wc;
+	struct rds_ib_ack_state state = { 0, };
+	struct rds_ib_recv_work *recv;
+
+	rdsdebug("conn %p cq %p\n", conn, cq);
+
+	rds_ib_stats_inc(s_ib_rx_cq_call);
+
+	ib_req_notify_cq(cq, IB_CQ_SOLICITED);
+
+	while (ib_poll_cq(cq, 1, &wc) > 0) {
+		rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n",
+			 (unsigned long long)wc.wr_id, wc.status, wc.byte_len,
+			 be32_to_cpu(wc.ex.imm_data));
+		rds_ib_stats_inc(s_ib_rx_cq_event);
+
+		recv = &ic->i_recvs[rds_ib_ring_oldest(&ic->i_recv_ring)];
+
+		rds_ib_recv_unmap_page(ic, recv);
+
+		if (rds_conn_up(conn)) {
+			/* We expect errors as the qp is drained during shutdown */
+			if (wc.status == IB_WC_SUCCESS) {
+				rds_ib_process_recv(conn, recv, wc.byte_len, &state);
+			} else {
+				rds_ib_conn_error(conn, "recv completion on "
+				       "%u.%u.%u.%u had status %u, disconnecting and "
+				       "reconnecting\n", NIPQUAD(conn->c_faddr),
+				       wc.status);
+			}
+		}
+
+		rds_ib_ring_free(&ic->i_recv_ring, 1);
+	}
+
+	if (state.ack_next_valid)
+		rds_ib_set_ack(ic, state.ack_next, state.ack_required);
+	if (state.ack_recv_valid && state.ack_recv > ic->i_ack_recv) {
+		rds_send_drop_acked(conn, state.ack_recv, NULL);
+		ic->i_ack_recv = state.ack_recv;
+	}
+	if (rds_conn_up(conn))
+		rds_ib_attempt_ack(ic);
+
+	/* If we ever end up with a really empty receive ring, we're
+	 * in deep trouble, as the sender will definitely see RNR
+	 * timeouts. */
+	if (rds_ib_ring_empty(&ic->i_recv_ring))
+		rds_ib_stats_inc(s_ib_rx_ring_empty);
+
+	/*
+	 * If the ring is running low, then schedule the thread to refill.
+	 */
+	if (rds_ib_ring_low(&ic->i_recv_ring))
+		queue_delayed_work(rds_wq, &conn->c_recv_w, 0);
+}
+
+int rds_ib_recv(struct rds_connection *conn)
+{
+	struct rds_ib_connection *ic = conn->c_transport_data;
+	int ret = 0;
+
+	rdsdebug("conn %p\n", conn);
+
+	/*
+	 * If we get a temporary posting failure in this context then
+	 * we're really low and we want the caller to back off for a bit.
+	 */
+	mutex_lock(&ic->i_recv_mutex);
+	if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0))
+		ret = -ENOMEM;
+	else
+		rds_ib_stats_inc(s_ib_rx_refill_from_thread);
+	mutex_unlock(&ic->i_recv_mutex);
+
+	return ret;
+}
+
+int __init rds_ib_recv_init(void)
+{
+	struct sysinfo si;
+	int ret = -ENOMEM;
+
+	/* Default to 30% of all available RAM for recv memory */
+	si_meminfo(&si);
+	rds_ib_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE;
+
+	rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming",
+					sizeof(struct rds_ib_incoming),
+					0, 0, NULL);
+	if (rds_ib_incoming_slab == NULL)
+		goto out;
+
+	rds_ib_frag_slab = kmem_cache_create("rds_ib_frag",
+					sizeof(struct rds_page_frag),
+					0, 0, NULL);
+	if (rds_ib_frag_slab == NULL)
+		kmem_cache_destroy(rds_ib_incoming_slab);
+	else
+		ret = 0;
+out:
+	return ret;
+}
+
+void rds_ib_recv_exit(void)
+{
+	kmem_cache_destroy(rds_ib_incoming_slab);
+	kmem_cache_destroy(rds_ib_frag_slab);
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 18/21] RDS/IB: Stats and sysctls
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (16 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 17/21] RDS/IB: Receive datagrams via IB Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 19/21] RDS: Documentation Andy Grover
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

IB-specific stats and sysctls.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/ulp/rds/ib_stats.c  |   95 ++++++++++++++++++++++
 drivers/infiniband/ulp/rds/ib_sysctl.c |  137 ++++++++++++++++++++++++++++++++
 2 files changed, 232 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c
 create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c

diff --git a/drivers/infiniband/ulp/rds/ib_stats.c b/drivers/infiniband/ulp/rds/ib_stats.c
new file mode 100644
index 0000000..02e3e3d
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_stats.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+#include "rds.h"
+#include "ib.h"
+
+DEFINE_PER_CPU(struct rds_ib_statistics, rds_ib_stats) ____cacheline_aligned;
+
+static char *rds_ib_stat_names[] = {
+	"ib_connect_raced",
+	"ib_listen_closed_stale",
+	"ib_tx_cq_call",
+	"ib_tx_cq_event",
+	"ib_tx_ring_full",
+	"ib_tx_throttle",
+	"ib_tx_sg_mapping_failure",
+	"ib_tx_stalled",
+	"ib_tx_credit_updates",
+	"ib_rx_cq_call",
+	"ib_rx_cq_event",
+	"ib_rx_ring_empty",
+	"ib_rx_refill_from_cq",
+	"ib_rx_refill_from_thread",
+	"ib_rx_alloc_limit",
+	"ib_rx_credit_updates",
+	"ib_ack_sent",
+	"ib_ack_send_failure",
+	"ib_ack_send_delayed",
+	"ib_ack_send_piggybacked",
+	"ib_ack_received",
+	"ib_rdma_mr_alloc",
+	"ib_rdma_mr_free",
+	"ib_rdma_mr_used",
+	"ib_rdma_mr_pool_flush",
+	"ib_rdma_mr_pool_wait",
+	"ib_rdma_mr_pool_depleted",
+};
+
+unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter,
+				    unsigned int avail)
+{
+	struct rds_ib_statistics stats = {0, };
+	uint64_t *src;
+	uint64_t *sum;
+	size_t i;
+	int cpu;
+
+	if (avail < ARRAY_SIZE(rds_ib_stat_names))
+		goto out;
+
+	for_each_online_cpu(cpu) {
+		src = (uint64_t *)&(per_cpu(rds_ib_stats, cpu));
+		sum = (uint64_t *)&stats;
+		for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++)
+			*(sum++) += *(src++);
+	}
+
+	rds_stats_info_copy(iter, (uint64_t *)&stats, rds_ib_stat_names,
+			    ARRAY_SIZE(rds_ib_stat_names));
+out:
+	return ARRAY_SIZE(rds_ib_stat_names);
+}
diff --git a/drivers/infiniband/ulp/rds/ib_sysctl.c b/drivers/infiniband/ulp/rds/ib_sysctl.c
new file mode 100644
index 0000000..d87830d
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/ib_sysctl.c
@@ -0,0 +1,137 @@
+/*
+ * Copyright (c) 2006 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/sysctl.h>
+#include <linux/proc_fs.h>
+
+#include "ib.h"
+
+static struct ctl_table_header *rds_ib_sysctl_hdr;
+
+unsigned long rds_ib_sysctl_max_send_wr = RDS_IB_DEFAULT_SEND_WR;
+unsigned long rds_ib_sysctl_max_recv_wr = RDS_IB_DEFAULT_RECV_WR;
+unsigned long rds_ib_sysctl_max_recv_allocation = (128 * 1024 * 1024) / RDS_FRAG_SIZE;
+static unsigned long rds_ib_sysctl_max_wr_min = 1;
+/* hardware will fail CQ creation long before this */
+static unsigned long rds_ib_sysctl_max_wr_max = (u32)~0;
+
+unsigned long rds_ib_sysctl_max_unsig_wrs = 16;
+static unsigned long rds_ib_sysctl_max_unsig_wr_min = 1;
+static unsigned long rds_ib_sysctl_max_unsig_wr_max = 64;
+
+unsigned long rds_ib_sysctl_max_unsig_bytes = (16 << 20);
+static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1;
+static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL;
+
+unsigned int rds_ib_sysctl_flow_control = 1;
+
+ctl_table rds_ib_sysctl_table[] = {
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_send_wr",
+		.data		= &rds_ib_sysctl_max_send_wr,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_wr_min,
+		.extra2		= &rds_ib_sysctl_max_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_recv_wr",
+		.data		= &rds_ib_sysctl_max_recv_wr,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_wr_min,
+		.extra2		= &rds_ib_sysctl_max_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_unsignaled_wr",
+		.data		= &rds_ib_sysctl_max_unsig_wrs,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_unsig_wr_min,
+		.extra2		= &rds_ib_sysctl_max_unsig_wr_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_unsignaled_bytes",
+		.data		= &rds_ib_sysctl_max_unsig_bytes,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+		.extra1		= &rds_ib_sysctl_max_unsig_bytes_min,
+		.extra2		= &rds_ib_sysctl_max_unsig_bytes_max,
+	},
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_recv_allocation",
+		.data		= &rds_ib_sysctl_max_recv_allocation,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "flow_control",
+		.data		= &rds_ib_sysctl_flow_control,
+		.maxlen		= sizeof(rds_ib_sysctl_flow_control),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{ .ctl_name = 0}
+};
+
+static struct ctl_path rds_ib_sysctl_path[] = {
+	{ .procname = "net", .ctl_name = CTL_NET, },
+	{ .procname = "rds", .ctl_name = CTL_UNNUMBERED, },
+	{ .procname = "ib", .ctl_name = CTL_UNNUMBERED, },
+	{ }
+};
+
+void rds_ib_sysctl_exit(void)
+{
+	if (rds_ib_sysctl_hdr)
+		unregister_sysctl_table(rds_ib_sysctl_hdr);
+}
+
+int __init rds_ib_sysctl_init(void)
+{
+	rds_ib_sysctl_hdr = register_sysctl_paths(rds_ib_sysctl_path, rds_ib_sysctl_table);
+	if (rds_ib_sysctl_hdr == NULL)
+		return -ENOMEM;
+	return 0;
+}
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 19/21] RDS: Documentation
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (17 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 18/21] RDS/IB: Stats and sysctls Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  2:17 ` [PATCH 20/21] RDS: Kconfig and Makefile Andy Grover
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

This file documents the specifics of the RDS sockets API,
as well as covering some of the details of its internal
implementation.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 Documentation/networking/rds.txt |  356 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 356 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/rds.txt

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
new file mode 100644
index 0000000..c67077c
--- /dev/null
+++ b/Documentation/networking/rds.txt
@@ -0,0 +1,356 @@
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports.  The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
+
+The high-level semantics of RDS from the application's point of view are
+
+ *	Addressing
+        RDS uses IPv4 addresses and 16bit port numbers to identify
+        the end point of a connection. All socket operations that involve
+        passing addresses between kernel and user space generally
+        use a struct sockaddr_in.
+
+        The fact that IPv4 addresses are used does not mean the underlying
+        transport has to be IP-based. In fact, RDS over IB uses a
+        reliable IB connection; the IP address is used exclusively to
+        locate the remote node's GID (by ARPing for the given IP).
+
+        The port space is entirely independent of UDP, TCP or any other
+        protocol.
+
+ *	Socket interface
+        RDS sockets work *mostly* as you would expect from a BSD
+        socket. The next section will cover the details. At any rate,
+        all I/O is performed through the standard BSD socket API.
+        Some additions like zerocopy support are implemented through
+        control messages, while other extensions use the getsockopt/
+        setsockopt calls.
+
+        Sockets must be bound before you can send or receive data.
+        This is needed because binding also selects a transport and
+        attaches it to the socket. Once bound, the transport assignment
+        does not change. RDS will tolerate IPs moving around (eg in
+        a active-active HA scenario), but only as long as the address
+        doesn't move to a different transport.
+
+ *	sysctls
+        RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+  AF_RDS, PF_RDS, SOL_RDS
+        These constants haven't been assigned yet, because RDS isn't in
+        mainline yet. Currently, the kernel module assigns some constant
+        and publishes it to user space through two sysctl files
+                /proc/sys/net/rds/pf_rds
+                /proc/sys/net/rds/sol_rds
+
+  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+        This creates a new, unbound RDS socket.
+
+  setsockopt(SOL_SOCKET): send and receive buffer size
+        RDS honors the send and receive buffer size socket options.
+        You are not allowed to queue more than SO_SNDSIZE bytes to
+        a socket. A message is queued when sendmsg is called, and
+        it leaves the queue when the remote system acknowledges
+        its arrival.
+
+        The SO_RCVSIZE option controls the maximum receive queue length.
+        This is a soft limit rather than a hard limit - RDS will
+        continue to accept and queue incoming messages, even if that
+        takes the queue length over the limit. However, it will also
+        mark the port as "congested" and send a congestion update to
+        the source node. The source node is supposed to throttle any
+        processes sending to this congested port.
+
+  bind(fd, &sockaddr_in, ...)
+        This binds the socket to a local IP address and port, and a
+        transport.
+
+  sendmsg(fd, ...)
+        Sends a message to the indicated recipient. The kernel will
+        transparently establish the underlying reliable connection
+        if it isn't up yet.
+
+        An attempt to send a message that exceeds SO_SNDSIZE will
+        return with -EMSGSIZE
+
+        An attempt to send a message that would take the total number
+        of queued bytes over the SO_SNDSIZE threshold will return
+        EAGAIN.
+
+        An attempt to send a message to a destination that is marked
+        as "congested" will return ENOBUFS.
+
+  recvmsg(fd, ...)
+        Receives a message that was queued to this socket. The sockets
+        recv queue accounting is adjusted, and if the queue length
+        drops below SO_SNDSIZE, the port is marked uncongested, and
+        a congestion update is sent to all peers.
+
+        Applications can ask the RDS kernel module to receive
+        notifications via control messages (for instance, there is a
+        notification when a congestion update arrived, or when a RDMA
+        operation completes). These notifications are received through
+        the msg.msg_control buffer of struct msghdr. The format of the
+        messages is described in manpages.
+
+  poll(fd)
+        RDS supports the poll interface to allow the application
+        to implement async I/O.
+
+        POLLIN handling is pretty straightforward. When there's an
+        incoming message queued to the socket, or a pending notification,
+        we signal POLLIN.
+
+        POLLOUT is a little harder. Since you can essentially send
+        to any destination, RDS will always signal POLLOUT as long as
+        there's room on the send queue (ie the number of bytes queued
+        is less than the sendbuf size).
+
+        However, the kernel will refuse to accept messages to
+        a destination marked congested - in this case you will loop
+        forever if you rely on poll to tell you what to do.
+        This isn't a trivial problem, but applications can deal with
+        this - by using congestion notifications, and by checking for
+        ENOBUFS errors returned by sendmsg.
+
+  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+        This allows the application to discard all messages queued to a
+        specific destination on this particular socket.
+
+        This allows the application to cancel outstanding messages if
+        it detects a timeout. For instance, if it tried to send a message,
+        and the remote host is unreachable, RDS will keep trying forever.
+        The application may decide it's not worth it, and cancel the
+        operation. In this case, it would use RDS_CANCEL_SENT_TO to
+        nuke any pending messages.
+
+
+RDMA for RDS
+============
+
+  see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+  see rds(7) manpage
+
+
+RDS Protocol
+============
+
+  Message header
+
+    The message header is a 'struct rds_header' (see rds.h):
+    Fields:
+      h_sequence:
+          per-packet sequence number
+      h_ack:
+          piggybacked acknowledgment of last packet received
+      h_len:
+          length of data, not including header
+      h_sport:
+          source port
+      h_dport:
+          destination port
+      h_flags:
+          CONG_BITMAP - this is a congestion update bitmap
+          ACK_REQUIRED - receiver must ack this packet
+          RETRANSMITTED - packet has previously been sent
+      h_credit:
+          indicate to other end of connection that
+          it has more credits available (i.e. there is
+          more send room)
+      h_padding[4]:
+          unused, for future use
+      h_csum:
+          header checksum
+      h_exthdr:
+          optional data can be passed here. This is currently used for
+          passing RDMA-related information.
+
+  ACK and retransmit handling
+
+      One might think that with reliable IB connections you wouldn't need
+      to ack messages that have been received.  The problem is that IB
+      hardware generates an ack message before it has DMAed the message
+      into memory.  This creates a potential message loss if the HCA is
+      disabled for any reason between when it sends the ack and before
+      the message is DMAed and processed.  This is only a potential issue
+      if another HCA is available for fail-over.
+
+      Sending an ack immediately would allow the sender to free the sent
+      message from their send queue quickly, but could cause excessive
+      traffic to be used for acks. RDS piggybacks acks on sent data
+      packets.  Ack-only packets are reduced by only allowing one to be
+      in flight at a time, and by the sender only asking for acks when
+      its send buffers start to fill up. All retransmissions are also
+      acked.
+
+  Flow Control
+
+      RDS's IB transport uses a credit-based mechanism to verify that
+      there is space in the peer's receive buffers for more data. This
+      eliminates the need for hardware retries on the connection.
+
+  Congestion
+
+      Messages waiting in the receive queue on the receiving socket
+      are accounted against the sockets SO_RCVBUF option value.  Only
+      the payload bytes in the message are accounted for.  If the
+      number of bytes queued equals or exceeds rcvbuf then the socket
+      is congested.  All sends attempted to this socket's address
+      should return block or return -EWOULDBLOCK.
+
+      Applications are expected to be reasonably tuned such that this
+      situation very rarely occurs.  An application encountering this
+      "back-pressure" is considered a bug.
+
+      This is implemented by having each node maintain bitmaps which
+      indicate which ports on bound addresses are congested.  As the
+      bitmap changes it is sent through all the connections which
+      terminate in the local address of the bitmap which changed.
+
+      The bitmaps are allocated as connections are brought up.  This
+      avoids allocation in the interrupt handling path which queues
+      sages on sockets.  The dense bitmaps let transports send the
+      entire bitmap on any bitmap change reasonably efficiently.  This
+      is much easier to implement than some finer-grained
+      communication of per-port congestion.  The sender does a very
+      inexpensive bit test to test if the port it's about to send to
+      is congested or not.
+
+
+RDS Transport Layer
+==================
+
+  As mentioned above, RDS is not IB-specific. Its code is divided
+  into a general RDS layer and a transport layer.
+
+  The general layer handles the socket API, congestion handling,
+  loopback, stats, usermem pinning, and the connection state machine.
+
+  The transport layer handles the details of the transport. The IB
+  transport, for example, handles all the queue pairs, work requests,
+  CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+  struct rds_message
+    aka possibly "rds_outgoing", the generic RDS layer copies data to
+    be sent and sets header fields as needed, based on the socket API.
+    This is then queued for the individual connection and sent by the
+    connection's transport.
+  struct rds_incoming
+    a generic struct referring to incoming data that can be handed from
+    the transport to the general code and queued by the general code
+    while the socket is awoken. It is then passed back to the transport
+    code to handle the actual copy-to-user.
+  struct rds_socket
+    per-socket information
+  struct rds_connection
+    per-connection information
+  struct rds_transport
+    pointers to transport-specific functions
+  struct rds_statistics
+    non-transport-specific statistics
+  struct rds_cong_map
+    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+  ERROR states.
+
+  The first time an attempt is made by an RDS socket to send data to
+  a node, a connection is allocated and connected. That connection is
+  then maintained forever -- if there are transport errors, the
+  connection will be dropped and re-established.
+
+  Dropping a connection while packets are queued will cause queued or
+  partially-sent datagrams to be retransmitted when the connection is
+  re-established.
+
+
+The send path
+=============
+
+  rds_sendmsg()
+    struct rds_message built from incoming data
+    CMSGs parsed (e.g. RDMA ops)
+    transport connection alloced and connected if not already
+    rds_message placed on send queue
+    send worker awoken
+  rds_send_worker()
+    calls rds_send_xmit() until queue is empty
+  rds_send_xmit()
+    transmits congestion map if one is pending
+    may set ACK_REQUIRED
+    calls transport to send either non-RDMA or RDMA message
+    (RDMA ops never retransmitted)
+  rds_ib_xmit()
+    allocs work requests from send ring
+    adds any new send credits available to peer (h_credits)
+    maps the rds_message's sg list
+    piggybacks ack
+    populates work requests
+    post send to connection's queue pair
+
+The recv path
+=============
+
+  rds_ib_recv_cq_comp_handler()
+    looks at write completions
+    unmaps recv buffer from device
+    no errors, call rds_ib_process_recv()
+    refill recv ring
+  rds_ib_process_recv()
+    validate header checksum
+    copy header to rds_ib_incoming struct if start of a new datagram
+    add to ibinc's fraglist
+    if competed datagram:
+      update cong map if datagram was cong update
+      call rds_recv_incoming() otherwise
+      note if ack is required
+  rds_recv_incoming()
+    drop duplicate packets
+    respond to pings
+    find the sock associated with this datagram
+    add to sock queue
+    wake up sock
+    do some congestion calculations
+  rds_recvmsg
+    copy data into user iovec
+    handle CMSGs
+    return to application
+
+
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 20/21] RDS: Kconfig and Makefile
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (18 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 19/21] RDS: Documentation Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-28 22:59   ` Roland Dreier
  2009-01-27  2:17 ` [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets Andy Grover
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

Add RDS Kconfig and Makefile, and modify infiniband's to add
us to the build.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 drivers/infiniband/Kconfig          |    2 ++
 drivers/infiniband/Makefile         |    1 +
 drivers/infiniband/ulp/rds/Kconfig  |   13 +++++++++++++
 drivers/infiniband/ulp/rds/Makefile |   13 +++++++++++++
 4 files changed, 29 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/ulp/rds/Kconfig
 create mode 100644 drivers/infiniband/ulp/rds/Makefile

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index dd0db67..1cba524 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,4 +54,6 @@ source "drivers/infiniband/ulp/srp/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 
+source "drivers/infiniband/ulp/rds/Kconfig"
+
 endif # INFINIBAND
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index ed35e44..39d0203 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES)		+= hw/nes/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
+obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/rds/
diff --git a/drivers/infiniband/ulp/rds/Kconfig b/drivers/infiniband/ulp/rds/Kconfig
new file mode 100644
index 0000000..bbc2ba4
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/Kconfig
@@ -0,0 +1,13 @@
+
+config INFINIBAND_RDS
+	tristate "Reliable Datagram Sockets (RDS) (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	---help---
+	  RDS provides reliable, sequenced delivery of datagrams
+	  over Infiniband.
+
+config INFINIBAND_RDS_DEBUG
+        bool "Debugging messages"
+	depends on INFINIBAND_RDS
+        default n
+
diff --git a/drivers/infiniband/ulp/rds/Makefile b/drivers/infiniband/ulp/rds/Makefile
new file mode 100644
index 0000000..d470550
--- /dev/null
+++ b/drivers/infiniband/ulp/rds/Makefile
@@ -0,0 +1,13 @@
+obj-$(CONFIG_INFINIBAND_RDS) += ib_rds.o
+
+ib_rds-y :=	af_rds.o bind.o cong.o connection.o info.o message.o   \
+			recv.o send.o stats.o sysctl.o threads.o transport.o \
+			loop.o page.o rdma.o
+
+ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
+			ib_sysctl.o ib_rdma.o
+
+ifeq ($(CONFIG_INFINIBAND_RDS_DEBUG), y)
+EXTRA_CFLAGS += -DDEBUG
+endif
+
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (19 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 20/21] RDS: Kconfig and Makefile Andy Grover
@ 2009-01-27  2:17 ` Andy Grover
  2009-01-27  7:27   ` Rémi Denis-Courmont
  2009-01-27 15:34 ` [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Steve Wise
  2009-01-28 22:37 ` Roland Dreier
  22 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-27  2:17 UTC (permalink / raw)
  To: rdreier; +Cc: rds-devel, general, netdev

RDS is a reliable datagram protocol used for IPC on Oracle
database clusters. This adds address and protocol family numbers
for it.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
---
 include/linux/socket.h |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 20fc4bb..fda91af 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -191,7 +191,8 @@ struct ucred {
 #define AF_RXRPC	33	/* RxRPC sockets 		*/
 #define AF_ISDN		34	/* mISDN sockets 		*/
 #define AF_PHONET	35	/* Phonet sockets		*/
-#define AF_MAX		36	/* For now.. */
+#define AF_RDS		36	/* RDS sockets 			*/
+#define AF_MAX		37	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -229,6 +230,7 @@ struct ucred {
 #define PF_RXRPC	AF_RXRPC
 #define PF_ISDN		AF_ISDN
 #define PF_PHONET	AF_PHONET
+#define PF_RDS		AF_RDS
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/21] RDS: Socket interface
  2009-01-27  2:17 ` [ofa-general] [PATCH 01/21] RDS: Socket interface Andy Grover
@ 2009-01-27  3:46   ` Stephen Hemminger
  2009-01-29  3:17     ` [ofa-general] " Andrew Grover
  2009-01-27  4:11   ` David Miller
  2009-01-27 12:08   ` Evgeniy Polyakov
  2 siblings, 1 reply; 64+ messages in thread
From: Stephen Hemminger @ 2009-01-27  3:46 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Mon, 26 Jan 2009 18:17:38 -0800
Andy Grover <andy.grover@oracle.com> wrote:

> Implement the RDS (Reliable Datagram Sockets) interface.
> 
> Signed-off-by: Andy Grover <andy.grover@oracle.com>
> ---
>  drivers/infiniband/ulp/rds/af_rds.c |  677 +++++++++++++++++++++++++++++++++++
>  drivers/infiniband/ulp/rds/bind.c   |  202 +++++++++++
>  2 files changed, 879 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/infiniband/ulp/rds/af_rds.c
>  create mode 100644 drivers/infiniband/ulp/rds/bind.c
> 
> diff --git a/drivers/infiniband/ulp/rds/af_rds.c b/drivers/infiniband/ulp/rds/af_rds.c
> new file mode 100644
> index 0000000..7158438
> --- /dev/null
> +++ b/drivers/infiniband/ulp/rds/af_rds.c
> @@ -0,0 +1,677 @@
> +/*
> + * Copyright (c) 2006 Oracle.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/in.h>
> +#include <linux/poll.h>
> +#include <linux/version.h>
> +#include <net/sock.h>
> +
> +#include "rds.h"
> +#include "rdma.h"
> +
> +static int enable_rdma = 1;
> +
> +module_param(enable_rdma, int, 0444);
> +MODULE_PARM_DESC(enable_rdma, " Enable RDMA operations support");

Module parameter is a cop-out for not doing it right.

> +/* this is just used for stats gathering :/ */

Then I would think a high speed protocol would use per-cpu
and/or rcu.

> +static DEFINE_SPINLOCK(rds_sock_lock);
> +static unsigned long rds_sock_count;
> +static LIST_HEAD(rds_sock_list);
> +DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq);
> +
> +/*
> + * This is called as the final descriptor referencing this socket is closed.
> + * We have to unbind the socket so that another socket can be bound to the
> + * address it was using.
> + *
> + * We have to be careful about racing with the incoming path.  sock_orphan()
> + * sets SOCK_DEAD and we use that as an indicator to the rx path that new
> + * messages shouldn't be queued.
> + */
> +static int rds_release(struct socket *sock)
> +{
> +	struct sock *sk = sock->sk;
> +	struct rds_sock *rs;
> +	unsigned long flags;
> +
> +	if (sk == NULL)
> +		goto out;
> +
> +	rs = rds_sk_to_rs(sk);
> +
> +	sock_orphan(sk);
> +	/* Note - rds_clear_recv_queue grabs rs_recv_lock, so
> +	 * that ensures the recv path has completed messing
> +	 * with the socket. */
> +	rds_clear_recv_queue(rs);
> +	rds_cong_remove_socket(rs);
> +	rds_remove_bound(rs);
> +	rds_send_drop_to(rs, NULL);
> +	rds_rdma_drop_keys(rs);
> +	rds_notify_queue_get(rs, NULL);
> +
> +	spin_lock_irqsave(&rds_sock_lock, flags);
> +	list_del_init(&rs->rs_item);
> +	rds_sock_count--;
> +	spin_unlock_irqrestore(&rds_sock_lock, flags);
> +
> +	sock->sk = NULL;
> +	sock_put(sk);
> +out:
> +	return 0;
> +}
> +
> +/*
> + * Careful not to race with rds_release -> sock_orphan which clears sk_sleep.
> + * _bh() isn't OK here, we're called from interrupt handlers.  It's probably OK
> + * to wake the waitqueue after sk_sleep is clear as we hold a sock ref, but
> + * this seems more conservative.
> + * NB - normally, one would use sk_callback_lock for this, but we can
> + * get here from interrupts, whereas the network code grabs sk_callback_lock
> + * with _lock_bh only - so relying on sk_callback_lock introduces livelocks.
> + */
> +void rds_wake_sk_sleep(struct rds_sock *rs)
> +{
> +	unsigned long flags;
> +
> +	read_lock_irqsave(&rs->rs_recv_lock, flags);
> +	__rds_wake_sk_sleep(rds_rs_to_sk(rs));
> +	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
> +}
> +
> +static int rds_getname(struct socket *sock, struct sockaddr *uaddr,
> +		       int *uaddr_len, int peer)
> +{
> +	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
> +	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
> +
> +	memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
> +
> +	/* racey, don't care */
> +	if (peer) {
> +		if (!rs->rs_conn_addr)
> +			return -ENOTCONN;
> +
> +		sin->sin_port = rs->rs_conn_port;
> +		sin->sin_addr.s_addr = rs->rs_conn_addr;
> +	} else {
> +		sin->sin_port = rs->rs_bound_port;
> +		sin->sin_addr.s_addr = rs->rs_bound_addr;
> +	}
> +
> +	sin->sin_family = AF_INET;
> +
> +	*uaddr_len = sizeof(*sin);
> +	return 0;
> +}
> +
> +/*
> + * RDS' poll is without a doubt the least intuitive part of the interface,
> + * as POLLIN and POLLOUT do not behave entirely as you would expect from
> + * a network protocol.
> + *
> + * POLLIN is asserted if
> + *  -	there is data on the receive queue.
> + *  -	to signal that a previously congested destination may have become
> + *	uncongested
> + *  -	A notification has been queued to the socket (this can be a congestion
> + *	update, or a RDMA completion).
> + *
> + * POLLOUT is asserted if there is room on the send queue. This does not mean
> + * however, that the next sendmsg() call will succeed. If the application tries
> + * to send to a congested destination, the system call may still fail (and
> + * return ENOBUFS).
> + */
> +static unsigned int rds_poll(struct file *file, struct socket *sock,
> +			     poll_table *wait)
> +{
> +	struct sock *sk = sock->sk;
> +	struct rds_sock *rs = rds_sk_to_rs(sk);
> +	unsigned int mask = 0;
> +	unsigned long flags;
> +
> +	poll_wait(file, sk->sk_sleep, wait);
> +
> +	poll_wait(file, &rds_poll_waitq, wait);
> +
> +	read_lock_irqsave(&rs->rs_recv_lock, flags);
> +	if (!rs->rs_cong_monitor) {
> +		/* When a congestion map was updated, we signal POLLIN for
> +		 * "historical" reasons. Applications can also poll for
> +		 * WRBAND instead. */
> +		if (rds_cong_updated_since(&rs->rs_cong_track))
> +			mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
> +	} else {
> +		spin_lock(&rs->rs_lock);
> +		if (rs->rs_cong_notify)
> +			mask |= (POLLIN | POLLRDNORM);
> +		spin_unlock(&rs->rs_lock);
> +	}
> +	if (!list_empty(&rs->rs_recv_queue)
> +	 || !list_empty(&rs->rs_notify_queue))
> +		mask |= (POLLIN | POLLRDNORM);
> +	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
> +		mask |= (POLLOUT | POLLWRNORM);
> +	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
> +
> +	return mask;
> +}
> +
> +static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
> +{
> +#ifdef KERNEL_HAS_CORE_CALLING_DEV_IOCTL
> +	return -ENOIOCTLCMD;
> +#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */
> +#ifndef KERNEL_HAS_CORE_CALLING_DEV_IOCTL
> +	return dev_ioctl(cmd, (void __user *)arg);
> +#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */
> +}
> +
> +static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval,
> +			      int len)
> +{
> +	struct sockaddr_in sin;
> +	int ret = 0;
> +
> +	/* racing with another thread binding seems ok here */
> +	if (rs->rs_bound_addr == 0) {
> +		ret = -ENOTCONN; /* XXX not a great errno */
> +		goto out;
> +	}
> +
> +	if (len < sizeof(struct sockaddr_in)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (copy_from_user(&sin, optval, sizeof(sin))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	rds_send_drop_to(rs, &sin);
> +out:
> +	return ret;
> +}
> +
> +static int rds_set_bool_option(unsigned char *optvar, char __user *optval,
> +			       int optlen)
> +{
> +	int value;
> +
> +	if (optlen < sizeof(int))
> +		return -EINVAL;
> +	if (get_user(value, (int __user *) optval))
> +		return -EFAULT;
> +	*optvar = !!value;
> +	return 0;
> +}
> +
> +static int rds_cong_monitor(struct rds_sock *rs, char __user *optval,
> +			    int optlen)
> +{
> +	int ret;
> +
> +	ret = rds_set_bool_option(&rs->rs_cong_monitor, optval, optlen);
> +	if (ret == 0) {
> +		if (rs->rs_cong_monitor) {
> +			rds_cong_add_socket(rs);
> +		} else {
> +			rds_cong_remove_socket(rs);
> +			rs->rs_cong_mask = 0;
> +			rs->rs_cong_notify = 0;
> +		}
> +	}
> +	return ret;
> +}
> +
> +static int rds_setsockopt(struct socket *sock, int level, int optname,
> +			  char __user *optval, int optlen)
> +{
> +	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
> +	int ret;
> +
> +	if (level != SOL_RDS) {
> +		ret = -ENOPROTOOPT;
> +		goto out;
> +	}
> +
> +	switch (optname) {
> +	case RDS_CANCEL_SENT_TO:
> +		ret = rds_cancel_sent_to(rs, optval, optlen);
> +		break;
> +	case RDS_GET_MR:
> +		if (enable_rdma)
> +			ret = rds_get_mr(rs, optval, optlen);
> +		else
> +			ret = -EOPNOTSUPP;
> +			break;
> +	case RDS_FREE_MR:
> +		if (enable_rdma)
> +			ret = rds_free_mr(rs, optval, optlen);
> +		else
> +			ret = -EOPNOTSUPP;
> +		break;
> +	case RDS_RECVERR:
> +		ret = rds_set_bool_option(&rs->rs_recverr, optval, optlen);
> +		break;
> +	case RDS_CONG_MONITOR:
> +		ret = rds_cong_monitor(rs, optval, optlen);
> +		break;
> +	default:
> +		ret = -ENOPROTOOPT;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int rds_getsockopt(struct socket *sock, int level, int optname,
> +			  char __user *optval, int __user *optlen)
> +{
> +	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
> +	int ret = -ENOPROTOOPT, len;
> +
> +	if (level != SOL_RDS)
> +		goto out;
> +
> +	if (get_user(len, optlen)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	switch (optname) {
> +	case RDS_INFO_FIRST ... RDS_INFO_LAST:
> +		ret = rds_info_getsockopt(sock, optname, optval,
> +					  optlen);
> +		break;
> +
> +	case RDS_RECVERR:
> +		if (len < sizeof(int))
> +			ret = -EINVAL;
> +		else
> +		if (put_user(rs->rs_recverr, (int __user *) optval)
> +		 || put_user(sizeof(int), optlen))
> +			ret = -EFAULT;
> +		else
> +			ret = 0;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +out:
> +	return ret;
> +
> +}
> +
> +static int rds_connect(struct socket *sock, struct sockaddr *uaddr,
> +		       int addr_len, int flags)
> +{
> +	struct sock *sk = sock->sk;
> +	struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
> +	struct rds_sock *rs = rds_sk_to_rs(sk);
> +	int ret = 0;
> +
> +	lock_sock(sk);
> +
> +	if (addr_len != sizeof(struct sockaddr_in)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (sin->sin_family != AF_INET) {
> +		ret = -EAFNOSUPPORT;
> +		goto out;
> +	}
> +
> +	if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) {
> +		ret = -EDESTADDRREQ;
> +		goto out;
> +	}
> +
> +	rs->rs_conn_addr = sin->sin_addr.s_addr;
> +	rs->rs_conn_port = sin->sin_port;
> +
> +out:
> +	release_sock(sk);
> +	return ret;
> +}
> +
> +#ifdef KERNEL_HAS_PROTO_REGISTER
> +static struct proto rds_proto = {
> +	.name	  = "RDS",
> +	.owner	  = THIS_MODULE,
> +	.obj_size = sizeof(struct rds_sock),
> +};
> +#endif /* KERNEL_HAS_PROTO_REGISTER */
> +
> +static struct proto_ops rds_proto_ops = {
> +	.family =	AF_RDS,
> +	.owner =	THIS_MODULE,
> +	.release =	rds_release,
> +	.bind =		rds_bind,
> +	.connect =	rds_connect,
> +	.socketpair =	sock_no_socketpair,
> +	.accept =	sock_no_accept,
> +	.getname =	rds_getname,
> +	.poll =		rds_poll,
> +	.ioctl =	rds_ioctl,
> +	.listen =	sock_no_listen,
> +	.shutdown =	sock_no_shutdown,
> +	.setsockopt =	rds_setsockopt,
> +	.getsockopt =	rds_getsockopt,
> +	.sendmsg =	rds_sendmsg,
> +	.recvmsg =	rds_recvmsg,
> +	.mmap =		sock_no_mmap,
> +	.sendpage =	sock_no_sendpage,
> +};
> +
> +#ifndef KERNEL_HAS_PROTO_REGISTER
> +static struct sock *sk_alloc_compat(int pf, gfp_t gfp, struct proto *prot,
> +				    int zero_it)
> +{
> +	struct rds_sock *rs;
> +
> +	sk = sk_alloc(pf, gfp, prot, zero_it);
> +	if (sk == NULL)
> +		return NULL;
> +
> +	rs = kcalloc(1, sizeof(struct rds_sock), GFP_ATOMIC);
> +	if (rs == NULL) {
> +		sk_free(sk);
> +		return NULL;
> +	}
> +
> +	/* sock_def_destruct frees this for us */
> +	sk->sk_protinfo = rs;
> +	rs->rs_sk = sk;
> +
> +	return sk;
> +}
> +
> +#undef sk_alloc
> +#define sk_alloc sk_alloc_compat
> +#endif /* KERNEL_HAS_PROTO_REGISTER */
> +
> +static int __rds_create(struct socket *sock, struct sock *sk, int protocol)
> +{
> +	unsigned long flags;
> +	struct rds_sock *rs;
> +
> +	sock_init_data(sock, sk);
> +#ifndef KERNEL_HAS_PROTO_REGISTER
> +	/* Can this be moved to sk_alloc_compat? */
> +	sk_set_owner(sk, THIS_MODULE);
> +#endif /* KERNEL_HAS_PROTO_REGISTER */
> +	sock->ops		= &rds_proto_ops;
> +	sk->sk_protocol		= protocol;
> +
> +	rs = rds_sk_to_rs(sk);
> +	spin_lock_init(&rs->rs_lock);
> +	rwlock_init(&rs->rs_recv_lock);
> +	INIT_LIST_HEAD(&rs->rs_send_queue);
> +	INIT_LIST_HEAD(&rs->rs_recv_queue);
> +	INIT_LIST_HEAD(&rs->rs_notify_queue);
> +	INIT_LIST_HEAD(&rs->rs_cong_list);
> +	spin_lock_init(&rs->rs_rdma_lock);
> +	rs->rs_rdma_keys = RB_ROOT;
> +
> +	spin_lock_irqsave(&rds_sock_lock, flags);
> +	list_add_tail(&rs->rs_item, &rds_sock_list);
> +	rds_sock_count++;
> +	spin_unlock_irqrestore(&rds_sock_lock, flags);
> +
> +	return 0;
> +}
> +
> +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
> +static int rds_create(struct socket *sock, int protocol)
> +{
> +	struct sock *sk;
> +
> +	if (sock->type != SOCK_SEQPACKET || protocol)
> +		return -ESOCKTNOSUPPORT;
> +
> +	sk = sk_alloc(AF_RDS, GFP_ATOMIC, &rds_proto, 1);
> +	if (sk == NULL)
> +		return -ENOMEM;
> +
> +	return __rds_create(sock, sk, protocol);
> +}
> +#else
> +static int rds_create(struct net *net, struct socket *sock, int protocol)
> +{
> +	struct sock *sk;
> +
> +	if (sock->type != SOCK_SEQPACKET || protocol)
> +		return -ESOCKTNOSUPPORT;
> +
> +	sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto);
> +	if (!sk)
> +		return -ENOMEM;
> +
> +	return __rds_create(sock, sk, protocol);
> +}
> +#endif
> +
> +void rds_sock_addref(struct rds_sock *rs)
> +{
> +	sock_hold(rds_rs_to_sk(rs));
> +}
> +
> +void rds_sock_put(struct rds_sock *rs)
> +{
> +	sock_put(rds_rs_to_sk(rs));
> +}
> +
> +static struct net_proto_family rds_family_ops = {
> +	.family =	AF_RDS,
> +	.create =	rds_create,
> +	.owner	=	THIS_MODULE,
> +};
> +
> +static void rds_sock_inc_info(struct socket *sock, unsigned int len,
> +			      struct rds_info_iterator *iter,
> +			      struct rds_info_lengths *lens)
> +{
> +	struct rds_sock *rs;
> +	struct sock *sk;
> +	struct rds_incoming *inc;
> +	unsigned long flags;
> +	unsigned int total = 0;
> +
> +	len /= sizeof(struct rds_info_message);
> +
> +	spin_lock_irqsave(&rds_sock_lock, flags);
> +
> +	list_for_each_entry(rs, &rds_sock_list, rs_item) {
> +		sk = rds_rs_to_sk(rs);
> +		read_lock(&rs->rs_recv_lock);
> +
> +		/* XXX too lazy to maintain counts.. */
> +		list_for_each_entry(inc, &rs->rs_recv_queue, i_item) {
> +			total++;
> +			if (total <= len)
> +				rds_inc_info_copy(inc, iter, inc->i_saddr,
> +						  rs->rs_bound_addr, 1);
> +		}
> +
> +		read_unlock(&rs->rs_recv_lock);
> +	}
> +
> +	spin_unlock_irqrestore(&rds_sock_lock, flags);
> +
> +	lens->nr = total;
> +	lens->each = sizeof(struct rds_info_message);
> +}
> +
> +static void rds_sock_info(struct socket *sock, unsigned int len,
> +			  struct rds_info_iterator *iter,
> +			  struct rds_info_lengths *lens)
> +{
> +	struct rds_info_socket sinfo;
> +	struct rds_sock *rs;
> +	unsigned long flags;
> +
> +	len /= sizeof(struct rds_info_socket);
> +
> +	spin_lock_irqsave(&rds_sock_lock, flags);
> +
> +	if (len < rds_sock_count)
> +		goto out;
> +
> +	list_for_each_entry(rs, &rds_sock_list, rs_item) {
> +		sinfo.sndbuf = rds_sk_sndbuf(rs);
> +		sinfo.rcvbuf = rds_sk_rcvbuf(rs);
> +		sinfo.bound_addr = rs->rs_bound_addr;
> +		sinfo.connected_addr = rs->rs_conn_addr;
> +		sinfo.bound_port = rs->rs_bound_port;
> +		sinfo.connected_port = rs->rs_conn_port;
> +		sinfo.inum = sock_i_ino(rds_rs_to_sk(rs));
> +
> +		rds_info_copy(iter, &sinfo, sizeof(sinfo));
> +	}
> +
> +out:
> +	lens->nr = rds_sock_count;
> +	lens->each = sizeof(struct rds_info_socket);
> +
> +	spin_unlock_irqrestore(&rds_sock_lock, flags);
> +}
> +
> +/*
> + * The order is important here.
> + *
> + * rds_trans_stop_listening() is called before conn_exit so new connections
> + * don't hit while existing ones are being torn down.
> + *
> + * rds_conn_exit() is before rds_trans_exit() as rds_conn_exit() calls into the
> + * transports to free connections and incoming fragments as they're torn down.
> + */
> +static void __exit rds_exit(void)
> +{
> +	rds_ib_exit();
> +	sock_unregister(rds_family_ops.family);
> +#ifdef KERNEL_HAS_PROTO_REGISTER
> +	proto_unregister(&rds_proto);
> +#endif /* KERNEL_HAS_PROTO_REGISTER */
> +	rds_trans_stop_listening();
> +	rds_conn_exit();
> +	rds_cong_exit();
> +	rds_sysctl_exit();
> +	rds_threads_exit();
> +	rds_stats_exit();
> +	rds_page_exit();
> +	rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info);
> +	rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
> +}
> +module_exit(rds_exit);
> +
> +static int __init rds_init(void)
> +{
> +	int ret;
> +
> +#ifndef KERNEL_HAS_NOT_DEFINED
> +	/* the strange ifdef above has scripts/makepatch.sh strip this out */
> +#if PF_RDS == 21
> +	printk(KERN_ERR "!!! This build of RDS is using PF 21 which is not "
> +	       "reserved\n");
> +	printk(KERN_ERR "!!! This is only suitable for testing, DO NOT "
> +	       "RELEASE THIS.\n");
> +	printk(KERN_ERR "!!! No, seriously.\n");
> +#endif
> +#endif /* KERNEL_HAS_NOT_DEFINED */

You don't want this ifdef crap in upstream

> +
> +	ret = rds_conn_init();
> +	if (ret)
> +		goto out;
> +	ret = rds_threads_init();
> +	if (ret)
> +		goto out_conn;
> +	ret = rds_sysctl_init();
> +	if (ret)
> +		goto out_threads;
> +	ret = rds_stats_init();
> +	if (ret)
> +		goto out_sysctl;
> +#ifdef KERNEL_HAS_PROTO_REGISTER
> +	ret = proto_register(&rds_proto, 1);
> +	if (ret)
> +		goto out_stats;
> +#endif /* KERNEL_HAS_PROTO_REGISTER */
> +	ret = sock_register(&rds_family_ops);
> +	if (ret)
> +		goto out_proto;
> +
> +	rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info);
> +	rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
> +
> +	ret = rds_ib_init();
> +	if (ret)
> +		goto out_sock;
> +	goto out;
> +
> +out_sock:
> +	sock_unregister(rds_family_ops.family);
> +out_proto:
> +#ifdef KERNEL_HAS_PROTO_REGISTER
> +	proto_unregister(&rds_proto);
> +out_stats:
> +#endif /* KERNEL_HAS_PROTO_REGISTER */
> +	rds_stats_exit();
> +out_sysctl:
> +	rds_sysctl_exit();
> +out_threads:
> +	rds_threads_exit();
> +out_conn:
> +	rds_conn_exit();
> +	rds_cong_exit();
> +	rds_page_exit();
> +out:
> +	return ret;
> +}
> +module_init(rds_init);
> +
> +#define DRV_VERSION     "4.0"
> +#define DRV_RELDATE     "July 28, 2008"
> +
> +MODULE_AUTHOR("Zach Brown");
> +MODULE_AUTHOR("Olaf Kirch");
> +MODULE_DESCRIPTION("RDS: Reliable Datagram Sockets"
> +		   " v" DRV_VERSION " (" DRV_RELDATE ")");
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_ALIAS_NETPROTO(PF_RDS);

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/21] RDS: Congestion-handling code
  2009-01-27  2:17 ` [PATCH 03/21] RDS: Congestion-handling code Andy Grover
@ 2009-01-27  3:48   ` Stephen Hemminger
  2009-01-27 19:15     ` Andrew Grover
  2009-01-27 13:10   ` Evgeniy Polyakov
  2009-01-28 22:57   ` Roland Dreier
  2 siblings, 1 reply; 64+ messages in thread
From: Stephen Hemminger @ 2009-01-27  3:48 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Mon, 26 Jan 2009 18:17:40 -0800
Andy Grover <andy.grover@oracle.com> wrote:

> RDS handles per-socket congestion by updating peers with a complete
> congestion map (8KB). This code keeps track of these maps for itself
> and ones received from peers.
> 
> Signed-off-by: Andy Grover <andy.grover@oracle.com>
> ---
>  drivers/infiniband/ulp/rds/cong.c |  424 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 424 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/infiniband/ulp/rds/cong.c
> 
> diff --git a/drivers/infiniband/ulp/rds/cong.c b/drivers/infiniband/ulp/rds/cong.c
> new file mode 100644
> index 0000000..b7c49d2
> --- /dev/null
> +++ b/drivers/infiniband/ulp/rds/cong.c
> @@ -0,0 +1,424 @@
> +/*
> + * Copyright (c) 2007 Oracle.  All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + */
> +#include <linux/types.h>
> +#include <linux/rbtree.h>
> +
> +#include "rds.h"
> +
> +/*
> + * This file implements the receive side of the unconventional congestion
> + * management in RDS.
> + *
> + * Messages waiting in the receive queue on the receiving socket are accounted
> + * against the sockets SO_RCVBUF option value.  Only the payload bytes in the
> + * message are accounted for.  If the number of bytes queued equals or exceeds
> + * rcvbuf then the socket is congested.  All sends attempted to this socket's
> + * address should return block or return -EWOULDBLOCK.
> + *
> + * Applications are expected to be reasonably tuned such that this situation
> + * very rarely occurs.  An application encountering this "back-pressure" is
> + * considered a bug.
> + *
> + * This is implemented by having each node maintain bitmaps which indicate
> + * which ports on bound addresses are congested.  As the bitmap changes it is
> + * sent through all the connections which terminate in the local address of the
> + * bitmap which changed.
> + *
> + * The bitmaps are allocated as connections are brought up.  This avoids
> + * allocation in the interrupt handling path which queues messages on sockets.
> + * The dense bitmaps let transports send the entire bitmap on any bitmap change
> + * reasonably efficiently.  This is much easier to implement than some
> + * finer-grained communication of per-port congestion.  The sender does a very
> + * inexpensive bit test to test if the port it's about to send to is congested
> + * or not.
> + */
> +
> +/*
> + * Interaction with poll is a tad tricky. We want all processes stuck in
> + * poll to wake up and check whether a congested destination became uncongested.
> + * The really sad thing is we have no idea which destinations the application
> + * wants to send to - we don't even know which rds_connections are involved.
> + * So until we implement a more flexible rds poll interface, we have to make
> + * do with this:
> + * We maintain a global counter that is incremented each time a congestion map
> + * update is received. Each rds socket tracks this value, and if rds_poll
> + * finds that the saved generation number is smaller than the global generation
> + * number, it wakes up the process.
> + */
> +static atomic_t		rds_cong_generation = ATOMIC_INIT(0);
> +
> +/*
> + * Congestion monitoring
> + */
> +static LIST_HEAD(rds_cong_monitor);
> +static DEFINE_RWLOCK(rds_cong_monitor_lock);
> +
> +/*
> + * Yes, a global lock.  It's used so infrequently that it's worth keeping it
> + * global to simplify the locking.  It's only used in the following
> + * circumstances:
> + *
> + *  - on connection buildup to associate a conn with its maps
> + *  - on map changes to inform conns of a new map to send
> + *
> + *  It's sadly ordered under the socket callback lock and the connection lock.
> + *  Receive paths can mark ports congested from interrupt context so the
> + *  lock masks interrupts.
> + */

So this is starting to look like another "Oracle special" like AIO
and HugeTLB. That has lots of caveat restrictions on the application.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 01/21] RDS: Socket interface
  2009-01-27  2:17 ` [ofa-general] [PATCH 01/21] RDS: Socket interface Andy Grover
  2009-01-27  3:46   ` Stephen Hemminger
@ 2009-01-27  4:11   ` David Miller
  2009-01-29 20:22     ` [ofa-general] ***SPAM*** " Andrew Grover
  2009-01-27 12:08   ` Evgeniy Polyakov
  2 siblings, 1 reply; 64+ messages in thread
From: David Miller @ 2009-01-27  4:11 UTC (permalink / raw)
  To: andy.grover; +Cc: netdev, rdreier, rds-devel, general


Socket family implementations do not belong under the
infiniband subdirectory.

Put it under net/ instead.

I don't care what the interdependencies happen to be.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets
  2009-01-27  2:17 ` [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets Andy Grover
@ 2009-01-27  7:27   ` Rémi Denis-Courmont
  2009-01-27 19:31     ` [ofa-general] " Andrew Grover
  0 siblings, 1 reply; 64+ messages in thread
From: Rémi Denis-Courmont @ 2009-01-27  7:27 UTC (permalink / raw)
  To: ext Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Tuesday 27 January 2009 04:17:58 ext Andy Grover, you wrote:
> RDS is a reliable datagram protocol used for IPC on Oracle
> database clusters. This adds address and protocol family numbers
> for it.
>
> Signed-off-by: Andy Grover <andy.grover@oracle.com>
> ---
>  include/linux/socket.h |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 20fc4bb..fda91af 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -191,7 +191,8 @@ struct ucred {
>  #define AF_RXRPC	33	/* RxRPC sockets 		*/
>  #define AF_ISDN		34	/* mISDN sockets 		*/
>  #define AF_PHONET	35	/* Phonet sockets		*/
> -#define AF_MAX		36	/* For now.. */
> +#define AF_RDS		36	/* RDS sockets 			*/
> +#define AF_MAX		37	/* For now.. */
>
>  /* Protocol families, same as address families. */
>  #define PF_UNSPEC	AF_UNSPEC
> @@ -229,6 +230,7 @@ struct ucred {
>  #define PF_RXRPC	AF_RXRPC
>  #define PF_ISDN		AF_ISDN
>  #define PF_PHONET	AF_PHONET
> +#define PF_RDS		AF_RDS
>  #define PF_MAX		AF_MAX
>
>  /* Maximum queue length specifiable by listen.  */

You also need to add lock class declaration to net/core/sock.c, I believe.

-- 
Rémi Denis-Courmont
Maemo Software, Nokia Devices R&D


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/21] RDS: Main header file
  2009-01-27  2:17 ` [ofa-general] [PATCH 02/21] RDS: Main header file Andy Grover
@ 2009-01-27  7:34   ` Rémi Denis-Courmont
  2009-01-27 19:27     ` [ofa-general] " Andrew Grover
  2009-01-27 13:05   ` Evgeniy Polyakov
  1 sibling, 1 reply; 64+ messages in thread
From: Rémi Denis-Courmont @ 2009-01-27  7:34 UTC (permalink / raw)
  To: ext Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Tuesday 27 January 2009 04:17:39 ext Andy Grover, you wrote:
> +/*
> + * XXX randomly chosen, but at least seems to be unused:
> + * #               18464-18768 Unassigned
> + * We should do better.  We want a reserved port to discourage unpriv'ed
> + * userspace from listening.
> + */
> +#define RDS_PORT	18634

Internet transport protocol port number? IANA has a process for assigning port 
numbers to proprietary protocols.

Not that I'd blame you, as I inherited VLC media player's wide abuse of port 
1234 as its current network core maintainer :(

> +#ifndef AF_RDS
> +#define AF_RDS          28      /* Reliable Datagram Socket     */
> +#endif
> +
> +#ifndef PF_RDS
> +#define PF_RDS          AF_RDS
> +#endif

You should probably remove that and put the last patch of your series ahead of 
this one.

> +#ifndef SOL_RDS
> +#define SOL_RDS         272
> +#endif

This is used by RXRPC nowadays, although I myself don't really understand why 
socket option levels need to be unique across all families.

-- 
Rémi Denis-Courmont
Maemo Software, Nokia Devices R&D


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/21] RDS: Socket interface
  2009-01-27  2:17 ` [ofa-general] [PATCH 01/21] RDS: Socket interface Andy Grover
  2009-01-27  3:46   ` Stephen Hemminger
  2009-01-27  4:11   ` David Miller
@ 2009-01-27 12:08   ` Evgeniy Polyakov
  2009-01-29  4:02     ` [ofa-general] " Andrew Grover
  2 siblings, 1 reply; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 12:08 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

Hi Andy.

On Mon, Jan 26, 2009 at 06:17:38PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> +/* this is just used for stats gathering :/ */

Shouldn't this be some kind of per-cpu data?

> +static DEFINE_SPINLOCK(rds_sock_lock);
> +static unsigned long rds_sock_count;
> +static LIST_HEAD(rds_sock_list);
> +DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq);

Global list of all sockets? This does not scale, maybe it should be
groupped into hash table or be per-device?

> +static int rds_release(struct socket *sock)
> +{
> +	struct sock *sk = sock->sk;
> +	struct rds_sock *rs;
> +	unsigned long flags;
> +
> +	if (sk == NULL)
> +		goto out;
> +
> +	rs = rds_sk_to_rs(sk);
> +
> +	sock_orphan(sk);

Why is it needed getting socket is about to be freed?

> +	/* Note - rds_clear_recv_queue grabs rs_recv_lock, so
> +	 * that ensures the recv path has completed messing
> +	 * with the socket. */
> +	rds_clear_recv_queue(rs);
> +	rds_cong_remove_socket(rs);
> +	rds_remove_bound(rs);
> +	rds_send_drop_to(rs, NULL);
> +	rds_rdma_drop_keys(rs);
> +	rds_notify_queue_get(rs, NULL);
> +
> +	spin_lock_irqsave(&rds_sock_lock, flags);
> +	list_del_init(&rs->rs_item);
> +	rds_sock_count--;
> +	spin_unlock_irqrestore(&rds_sock_lock, flags);

Does RDS sockets work with high number of creation/destruction
workloads?
> +static unsigned int rds_poll(struct file *file, struct socket *sock,
> +			     poll_table *wait)
> +{
> +	struct sock *sk = sock->sk;
> +	struct rds_sock *rs = rds_sk_to_rs(sk);
> +	unsigned int mask = 0;
> +	unsigned long flags;
> +
> +	poll_wait(file, sk->sk_sleep, wait);
> +
> +	poll_wait(file, &rds_poll_waitq, wait);
> +

Are you absolutely sure that provided poll_table callback
will not do the bad things here? It is quite unusual to add several
different queues into the same head in the poll callback.
And shouldn't rds_poll_waitq be lock protected here?

> +	read_lock_irqsave(&rs->rs_recv_lock, flags);
> +	if (!rs->rs_cong_monitor) {
> +		/* When a congestion map was updated, we signal POLLIN for
> +		 * "historical" reasons. Applications can also poll for
> +		 * WRBAND instead. */
> +		if (rds_cong_updated_since(&rs->rs_cong_track))
> +			mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
> +	} else {
> +		spin_lock(&rs->rs_lock);

Is there a possibility to have lock iteraction problem with above
rs_recv_lock read lock?

> +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)

This should be dropped in the mainline tree.

> +/*
> + * XXX this probably still needs more work.. no INADDR_ANY, and rbtrees aren't
> + * particularly zippy.
> + *
> + * This is now called for every incoming frame so we arguably care much more
> + * about it than we used to.
> + */
> +static DEFINE_SPINLOCK(rds_bind_lock);
> +static struct rb_root rds_bind_tree = RB_ROOT;

Hash table with the appropriate size will have faster lookup/access
times btw.

> +static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port,
> +					   struct rds_sock *insert)
> +{
> +	struct rb_node **p = &rds_bind_tree.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct rds_sock *rs;
> +	u64 cmp;
> +	u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
> +
> +	while (*p) {
> +		parent = *p;
> +		rs = rb_entry(parent, struct rds_sock, rs_bound_node);
> +
> +		cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
> +		      be16_to_cpu(rs->rs_bound_port);
> +
> +		if (needle < cmp)

Should it use wrapping logic if some field overflows?

> +	rdsdebug("returning rs %p for %u.%u.%u.%u:%u\n", rs, NIPQUAD(addr),
> +		ntohs(port));

Iirc there is a new %pi4 or similar format id.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/21] RDS: Main header file
  2009-01-27  2:17 ` [ofa-general] [PATCH 02/21] RDS: Main header file Andy Grover
  2009-01-27  7:34   ` Rémi Denis-Courmont
@ 2009-01-27 13:05   ` Evgeniy Polyakov
  2009-01-27 19:23     ` [ofa-general] ***SPAM*** " Andrew Grover
  1 sibling, 1 reply; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:05 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

Hi.

On Mon, Jan 26, 2009 at 06:17:39PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> +/*
> + * XXX randomly chosen, but at least seems to be unused:
> + * #               18464-18768 Unassigned
> + * We should do better.  We want a reserved port to discourage unpriv'ed
> + * userspace from listening.
> + */
> +#define RDS_PORT	18634
> +

What will happen if some application already uses that port?

> +#ifndef AF_RDS
> +#define AF_RDS          28      /* Reliable Datagram Socket     */
> +#endif

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/21] RDS: Congestion-handling code
  2009-01-27  2:17 ` [PATCH 03/21] RDS: Congestion-handling code Andy Grover
  2009-01-27  3:48   ` Stephen Hemminger
@ 2009-01-27 13:10   ` Evgeniy Polyakov
  2009-01-27 19:10     ` Andrew Grover
  2009-01-28 22:57   ` Roland Dreier
  2 siblings, 1 reply; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:10 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Mon, Jan 26, 2009 at 06:17:40PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> +/*
> + * Yes, a global lock.  It's used so infrequently that it's worth keeping it
> + * global to simplify the locking.  It's only used in the following
> + * circumstances:
> + *
> + *  - on connection buildup to associate a conn with its maps

Is this a rare condition? Is this protocol only intended for the
long-living connections and is not suitable for the cases when lots of
them are created and teared down quickly?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/21] RDS: Transport code
  2009-01-27  2:17 ` [PATCH 04/21] RDS: Transport code Andy Grover
@ 2009-01-27 13:18   ` Evgeniy Polyakov
  2009-01-27 19:36     ` Andrew Grover
  0 siblings, 1 reply; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:18 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Mon, Jan 26, 2009 at 06:17:41PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> +static LIST_HEAD(transports);
> +static DECLARE_RWSEM(trans_sem);
> +

RDS_ prefix?

> +int rds_trans_register(struct rds_transport *trans)
> +{
> +	BUG_ON(strlen(trans->t_name) + 1 >
> +	       sizeof(((struct rds_info_connection *)0)->transport));
> +

Wow. Why not declare 15 as some constant and put it into rds_transport
structure definition?

> +struct rds_transport *rds_trans_get_preferred(__be32 addr)
> +{
> +	struct rds_transport *trans;
> +	struct rds_transport *ret = NULL;
> +
> +        if (IN_LOOPBACK(ntohl(addr)))
> +                return &rds_loop_transport;
> +

Tabs have run away.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/21] RDS: Info and stats
  2009-01-27  2:17 ` [ofa-general] [PATCH 05/21] RDS: Info and stats Andy Grover
@ 2009-01-27 13:28   ` Evgeniy Polyakov
  0 siblings, 0 replies; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:28 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Mon, Jan 26, 2009 at 06:17:42PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> +void rds_info_register_func(int optname, rds_info_func func)
> +{
> +	int offset = optname - RDS_INFO_FIRST;
> +
> +	BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST);
> +
> +	spin_lock(&rds_info_lock);
> +	BUG_ON(rds_info_funcs[offset] != NULL);
> +	rds_info_funcs[offset] = func;
> +	spin_unlock(&rds_info_lock);
> +}
> +EXPORT_SYMBOL_GPL(rds_info_register_func);
> +
> +void rds_info_deregister_func(int optname, rds_info_func func)
> +{
> +	int offset = optname - RDS_INFO_FIRST;
> +
> +	BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST);
> +

Those bug_ons look quite scary, is there a way to actually have a wrong
optname? Plus, those _INFO definitions are declared twice in the code,
which makes it harder to update.

> +/*
> + * Typically we hold an atomic kmap across multiple rds_info_copy() calls
> + * because the kmap is so expensive.  This must be called before using blocking
> + * operations while holding the mapping and as the iterator is torn down.
> + */
> +void rds_info_iter_unmap(struct rds_info_iterator *iter)
> +{
> +	if (iter->addr != NULL) {
> +		kunmap_atomic(iter->addr, KM_USER0);
> +		iter->addr = NULL;
> +	}
> +}
> +

This one is used to temporarily map some address, but functions called
between map and unmap functions (like rds_info_getsockopt()) may sleep,
which is wrong.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/21] RDS: Connection handling
  2009-01-27  2:17 ` [PATCH 06/21] RDS: Connection handling Andy Grover
@ 2009-01-27 13:34   ` Evgeniy Polyakov
  2009-01-27 13:47     ` Oliver Neukum
  0 siblings, 1 reply; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:34 UTC (permalink / raw)
  To: Andy Grover; +Cc: rdreier, rds-devel, general, netdev

On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> +static inline int rds_conn_is_sending(struct rds_connection *conn)
> +{
> +	int ret = 0;
> +
> +	if (!mutex_trylock(&conn->c_send_lock))
> +		ret = 1;
> +	else
> +		mutex_unlock(&conn->c_send_lock);
> +
> +	return ret;
> +}
> +

This one is eventually invoked under the spin_lock with turned off irqs,
which may freeze the machine:
rds_for_each_conn_info() -> spin_lock_irqsave(global lock) ->
rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending()
-> boom.

I did not not check further though.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/21] RDS: Connection handling
  2009-01-27 13:34   ` Evgeniy Polyakov
@ 2009-01-27 13:47     ` Oliver Neukum
  2009-01-27 13:51       ` Evgeniy Polyakov
  2009-01-27 16:28       ` [ofa-general] " Steve Wise
  0 siblings, 2 replies; 64+ messages in thread
From: Oliver Neukum @ 2009-01-27 13:47 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Andy Grover, rdreier, rds-devel, general, netdev

Am Tuesday 27 January 2009 14:34:19 schrieb Evgeniy Polyakov:
> On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> > +static inline int rds_conn_is_sending(struct rds_connection *conn)
> > +{
> > +	int ret = 0;
> > +
> > +	if (!mutex_trylock(&conn->c_send_lock))
> > +		ret = 1;
> > +	else
> > +		mutex_unlock(&conn->c_send_lock);
> > +
> > +	return ret;
> > +}
> > +
> 
> This one is eventually invoked under the spin_lock with turned off irqs,
> which may freeze the machine:
> rds_for_each_conn_info() -> spin_lock_irqsave(global lock) ->
> rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending()
> -> boom.

Why? This is _trylock. It won't block.

	Regards
		Oliver
 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/21] RDS: Connection handling
  2009-01-27 13:47     ` Oliver Neukum
@ 2009-01-27 13:51       ` Evgeniy Polyakov
  2009-01-27 16:28       ` [ofa-general] " Steve Wise
  1 sibling, 0 replies; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 13:51 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Andy Grover, rdreier, rds-devel, general, netdev

On Tue, Jan 27, 2009 at 02:47:27PM +0100, Oliver Neukum (oliver@neukum.org) wrote:
> > > +static inline int rds_conn_is_sending(struct rds_connection *conn)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	if (!mutex_trylock(&conn->c_send_lock))
> > > +		ret = 1;
> > > +	else
> > > +		mutex_unlock(&conn->c_send_lock);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > 
> > This one is eventually invoked under the spin_lock with turned off irqs,
> > which may freeze the machine:
> > rds_for_each_conn_info() -> spin_lock_irqsave(global lock) ->
> > rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending()
> > -> boom.
> 
> Why? This is _trylock. It won't block.

Unlock may reschedule.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS)
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (20 preceding siblings ...)
  2009-01-27  2:17 ` [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets Andy Grover
@ 2009-01-27 15:34 ` Steve Wise
  2009-01-27 19:29   ` ***SPAM*** " Andrew Grover
  2009-01-28 22:37 ` Roland Dreier
  22 siblings, 1 reply; 64+ messages in thread
From: Steve Wise @ 2009-01-27 15:34 UTC (permalink / raw)
  To: Andy Grover; +Cc: netdev, rdreier, rds-devel, general

Hey Andy,

Why didn't you include the iWARP transport as well?


Andy Grover wrote:
> Hi Roland,
>
> This patchset adds support for RDS as an Infiniband ULP. RDS is an
> Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably,
> and is used currently in Oracle RAC and Exadata products. It's lived
> in OFED for 2+ years and I think it's time to get it upstream -- most
> likely into your -next tree for .30, but if it snuck into .29 via the
> "new code merge-window exception" then even better.
>
> I've run checkpatch & sparse to clean up as many issues as possible so
> what remains are really the design peculiarities (aka warts) that arise
> from being a protocol designed by one company for a single critical
> application. I think upstreaming this code is the first step towards
> working out those issues, and making the end result available to a wider
> audience.
>
> Also available for review at:
> git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6 for-roland
>
> Thoughts? shortlog follows.
>
> Thanks -- Regards -- Andy
>
> Andy Grover (21):
>       RDS: Socket interface
>       RDS: Main header file
>       RDS: Congestion-handling code
>       RDS: Transport code
>       RDS: Info and stats
>       RDS: Connection handling
>       RDS: loopback
>       RDS: sysctls
>       RDS: Message parsing
>       RDS: send.c
>       RDS: recv.c
>       RDS: RDMA support
>       RDS/IB: Infiniband transport
>       RDS/IB: Ring-handling code.
>       RDS/IB: Implement RDMA ops using FMRs
>       RDS/IB: Implement IB-specific datagram send.
>       RDS/IB: Receive datagrams via IB
>       RDS/IB: Stats and sysctls
>       RDS: Documentation
>       RDS: Kconfig and Makefile
>       RDS: Add AF and PF #defines for RDS sockets
>
>  Documentation/networking/rds.txt        |  356 +++++++++++
>  drivers/infiniband/Kconfig              |    2 +
>  drivers/infiniband/Makefile             |    1 +
>  drivers/infiniband/ulp/rds/Kconfig      |   13 +
>  drivers/infiniband/ulp/rds/Makefile     |   13 +
>  drivers/infiniband/ulp/rds/af_rds.c     |  677 +++++++++++++++++++++
>  drivers/infiniband/ulp/rds/bind.c       |  202 +++++++
>  drivers/infiniband/ulp/rds/cong.c       |  424 +++++++++++++
>  drivers/infiniband/ulp/rds/connection.c |  501 +++++++++++++++
>  drivers/infiniband/ulp/rds/ib.c         |  312 ++++++++++
>  drivers/infiniband/ulp/rds/ib.h         |  358 +++++++++++
>  drivers/infiniband/ulp/rds/ib_cm.c      |  882
> +++++++++++++++++++++++++++
>  drivers/infiniband/ulp/rds/ib_rdma.c    |  641 ++++++++++++++++++++
>  drivers/infiniband/ulp/rds/ib_recv.c    |  894
> +++++++++++++++++++++++++++
>  drivers/infiniband/ulp/rds/ib_ring.c    |  168 +++++
>  drivers/infiniband/ulp/rds/ib_send.c    |  852
> ++++++++++++++++++++++++++
>  drivers/infiniband/ulp/rds/ib_stats.c   |   95 +++
>  drivers/infiniband/ulp/rds/ib_sysctl.c  |  137 +++++
>  drivers/infiniband/ulp/rds/info.c       |  243 ++++++++
>  drivers/infiniband/ulp/rds/info.h       |   43 ++
>  drivers/infiniband/ulp/rds/loop.c       |  189 ++++++
>  drivers/infiniband/ulp/rds/loop.h       |    9 +
>  drivers/infiniband/ulp/rds/message.c    |  414 +++++++++++++
>  drivers/infiniband/ulp/rds/page.c       |  222 +++++++
>  drivers/infiniband/ulp/rds/rdma.c       |  682 +++++++++++++++++++++
>  drivers/infiniband/ulp/rds/rdma.h       |   84 +++
>  drivers/infiniband/ulp/rds/rds.h        |  763 +++++++++++++++++++++++
>  drivers/infiniband/ulp/rds/rds_rdma.h   |  245 ++++++++
>  drivers/infiniband/ulp/rds/recv.c       |  550 +++++++++++++++++
>  drivers/infiniband/ulp/rds/send.c       | 1006
> +++++++++++++++++++++++++++++++
>  drivers/infiniband/ulp/rds/stats.c      |  150 +++++
>  drivers/infiniband/ulp/rds/sysctl.c     |  164 +++++
>  drivers/infiniband/ulp/rds/threads.c    |  273 +++++++++
>  drivers/infiniband/ulp/rds/transport.c  |  134 ++++
>  include/linux/socket.h                  |    4 +-
>  35 files changed, 11702 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/networking/rds.txt
>  create mode 100644 drivers/infiniband/ulp/rds/Kconfig
>  create mode 100644 drivers/infiniband/ulp/rds/Makefile
>  create mode 100644 drivers/infiniband/ulp/rds/af_rds.c
>  create mode 100644 drivers/infiniband/ulp/rds/bind.c
>  create mode 100644 drivers/infiniband/ulp/rds/cong.c
>  create mode 100644 drivers/infiniband/ulp/rds/connection.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib.h
>  create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib_send.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c
>  create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c
>  create mode 100644 drivers/infiniband/ulp/rds/info.c
>  create mode 100644 drivers/infiniband/ulp/rds/info.h
>  create mode 100644 drivers/infiniband/ulp/rds/loop.c
>  create mode 100644 drivers/infiniband/ulp/rds/loop.h
>  create mode 100644 drivers/infiniband/ulp/rds/message.c
>  create mode 100644 drivers/infiniband/ulp/rds/page.c
>  create mode 100644 drivers/infiniband/ulp/rds/rdma.c
>  create mode 100644 drivers/infiniband/ulp/rds/rdma.h
>  create mode 100644 drivers/infiniband/ulp/rds/rds.h
>  create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h
>  create mode 100644 drivers/infiniband/ulp/rds/recv.c
>  create mode 100644 drivers/infiniband/ulp/rds/send.c
>  create mode 100644 drivers/infiniband/ulp/rds/stats.c
>  create mode 100644 drivers/infiniband/ulp/rds/sysctl.c
>  create mode 100644 drivers/infiniband/ulp/rds/threads.c
>  create mode 100644 drivers/infiniband/ulp/rds/transport.c
>
> end
>
> _______________________________________________
> general mailing list
> general@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling
  2009-01-27 13:47     ` Oliver Neukum
  2009-01-27 13:51       ` Evgeniy Polyakov
@ 2009-01-27 16:28       ` Steve Wise
  2009-01-29  3:03         ` ***SPAM*** " Andrew Grover
  1 sibling, 1 reply; 64+ messages in thread
From: Steve Wise @ 2009-01-27 16:28 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Evgeniy Polyakov, rdreier, rds-devel, general, netdev

Oliver Neukum wrote:
> Am Tuesday 27 January 2009 14:34:19 schrieb Evgeniy Polyakov:
>   
>> On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
>>     
>>> +static inline int rds_conn_is_sending(struct rds_connection *conn)
>>> +{
>>> +	int ret = 0;
>>> +
>>> +	if (!mutex_trylock(&conn->c_send_lock))
>>> +		ret = 1;
>>> +	else
>>> +		mutex_unlock(&conn->c_send_lock);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>       
>> This one is eventually invoked under the spin_lock with turned off irqs,
>> which may freeze the machine:
>> rds_for_each_conn_info() -> spin_lock_irqsave(global lock) ->
>> rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending()
>> -> boom.
>>     
>
> Why? This is _trylock. It won't block.
>
>   
mutex_trylock() uses spin_lock_mutex() which has this in the debug version:

DEBUG_LOCKS_WARN_ON(in_interrupt());

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/21] RDS: Congestion-handling code
  2009-01-27 13:10   ` Evgeniy Polyakov
@ 2009-01-27 19:10     ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 19:10 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Andy Grover, rdreier, rds-devel, general, netdev

On Tue, Jan 27, 2009 at 5:10 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> On Mon, Jan 26, 2009 at 06:17:40PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
>> +/*
>> + * Yes, a global lock.  It's used so infrequently that it's worth keeping it
>> + * global to simplify the locking.  It's only used in the following
>> + * circumstances:
>> + *
>> + *  - on connection buildup to associate a conn with its maps
>
> Is this a rare condition? Is this protocol only intended for the
> long-living connections and is not suitable for the cases when lots of
> them are created and teared down quickly?

Connections are long-lived. Imagine a cluster. RDS multiplexes all
sockets' datagrams between 2 hosts over a single transport-layer
connection, so if a node sends ONE datagram to another, an IB
connection is set up and sticks around indefinitely.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/21] RDS: Congestion-handling code
  2009-01-27  3:48   ` Stephen Hemminger
@ 2009-01-27 19:15     ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 19:15 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Andy Grover, rdreier, rds-devel, general, netdev

On Mon, Jan 26, 2009 at 7:48 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> So this is starting to look like another "Oracle special" like AIO
> and HugeTLB. That has lots of caveat restrictions on the application.

Yep it's a datacenter-centric protocol.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] ***SPAM*** Re: [PATCH 02/21] RDS: Main header file
  2009-01-27 13:05   ` Evgeniy Polyakov
@ 2009-01-27 19:23     ` Andrew Grover
  2009-01-27 19:24       ` Steve Wise
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 19:23 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev, rdreier, rds-devel, general

On Tue, Jan 27, 2009 at 5:05 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
>> +#define RDS_PORT     18634
>> +
>
> What will happen if some application already uses that port?

RDS errors out.

Yeah we're going to want to get an assigned port at some point, I guess.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [ofa-general] ***SPAM*** Re: [PATCH 02/21] RDS: Main header file
  2009-01-27 19:23     ` [ofa-general] ***SPAM*** " Andrew Grover
@ 2009-01-27 19:24       ` Steve Wise
  0 siblings, 0 replies; 64+ messages in thread
From: Steve Wise @ 2009-01-27 19:24 UTC (permalink / raw)
  To: Andrew Grover; +Cc: Evgeniy Polyakov, rdreier, rds-devel, general, netdev

Andrew Grover wrote:
> On Tue, Jan 27, 2009 at 5:05 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
>   
>>> +#define RDS_PORT     18634
>>> +
>>>       
>> What will happen if some application already uses that port?
>>     
>
> RDS errors out.
>
> Yeah we're going to want to get an assigned port at some point, I guess.
>
>   

You should start that process now.  It took a while to get nfsrdma's 
port number through...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 02/21] RDS: Main header file
  2009-01-27  7:34   ` Rémi Denis-Courmont
@ 2009-01-27 19:27     ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 19:27 UTC (permalink / raw)
  To: Rémi Denis-Courmont; +Cc: netdev, rdreier, rds-devel, general

On Mon, Jan 26, 2009 at 11:34 PM, Rémi Denis-Courmont
<remi.denis-courmont@nokia.com> wrote:

>> +#ifndef PF_RDS
>> +#define PF_RDS          AF_RDS
>> +#endif
>
> You should probably remove that and put the last patch of your series ahead of
> this one.

Yup will do.

>> +#ifndef SOL_RDS
>> +#define SOL_RDS         272
>> +#endif
>
> This is used by RXRPC nowadays, although I myself don't really understand why
> socket option levels need to be unique across all families.

OK, shouldn't be too hard to fix.

Thanks.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* ***SPAM*** Re: [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS)
  2009-01-27 15:34 ` [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Steve Wise
@ 2009-01-27 19:29   ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 19:29 UTC (permalink / raw)
  To: Steve Wise; +Cc: netdev, rdreier, rds-devel, general

On Tue, Jan 27, 2009 at 7:34 AM, Steve Wise <swise@opengridcomputing.com> wrote:
> Hey Andy,
>
> Why didn't you include the iWARP transport as well?

Hi Steve,

As I mentioned on IRC, there are some ib/iw coexistence issues and
other minor bugs to resolve, and then I will include the iWARP code.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets
  2009-01-27  7:27   ` Rémi Denis-Courmont
@ 2009-01-27 19:31     ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 19:31 UTC (permalink / raw)
  To: Rémi Denis-Courmont; +Cc: netdev, rdreier, rds-devel, general

On Mon, Jan 26, 2009 at 11:27 PM, Rémi Denis-Courmont
<remi.denis-courmont@nokia.com> wrote:
> You also need to add lock class declaration to net/core/sock.c, I believe.

Very true, thanks.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/21] RDS: Transport code
  2009-01-27 13:18   ` Evgeniy Polyakov
@ 2009-01-27 19:36     ` Andrew Grover
  2009-01-27 21:56       ` [ofa-general] " Evgeniy Polyakov
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 19:36 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Andy Grover, rdreier, rds-devel, general, netdev

On Tue, Jan 27, 2009 at 5:18 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> On Mon, Jan 26, 2009 at 06:17:41PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
>> +static LIST_HEAD(transports);
>> +static DECLARE_RWSEM(trans_sem);
>> +
>
> RDS_ prefix?

Even needed for statics?

>> +int rds_trans_register(struct rds_transport *trans)
>> +{
>> +     BUG_ON(strlen(trans->t_name) + 1 >
>> +            sizeof(((struct rds_info_connection *)0)->transport));
>> +
>
> Wow. Why not declare 15 as some constant and put it into rds_transport
> structure definition?

Makes sense.

>> +        if (IN_LOOPBACK(ntohl(addr)))
>> +                return &rds_loop_transport;
>> +
>
> Tabs have run away.

Will fix. Thanks.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 04/21] RDS: Transport code
  2009-01-27 19:36     ` Andrew Grover
@ 2009-01-27 21:56       ` Evgeniy Polyakov
  2009-01-27 22:15         ` [ofa-general] ***SPAM*** " Andrew Grover
  0 siblings, 1 reply; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-27 21:56 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, rdreier, rds-devel, general

On Tue, Jan 27, 2009 at 11:36:37AM -0800, Andrew Grover (andy.grover@gmail.com) wrote:
> On Tue, Jan 27, 2009 at 5:18 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> > On Mon, Jan 26, 2009 at 06:17:41PM -0800, Andy Grover (andy.grover@oracle.com) wrote:
> >> +static LIST_HEAD(transports);
> >> +static DECLARE_RWSEM(trans_sem);
> >> +
> >
> > RDS_ prefix?
> 
> Even needed for statics?

It confuses tags and the like otherwise, and looks more consistent with
the rest of the code. Likely it is not a must, but just better look.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] ***SPAM*** Re: [PATCH 04/21] RDS: Transport code
  2009-01-27 21:56       ` [ofa-general] " Evgeniy Polyakov
@ 2009-01-27 22:15         ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-27 22:15 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev, rdreier, rds-devel, general

On Tue, Jan 27, 2009 at 1:56 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
>> > RDS_ prefix?
>>
>> Even needed for statics?
>
> It confuses tags and the like otherwise, and looks more consistent with
> the rest of the code. Likely it is not a must, but just better look.

Yup, will do, just was curious.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/21] Reliable Datagram Sockets (RDS)
  2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
                   ` (21 preceding siblings ...)
  2009-01-27 15:34 ` [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Steve Wise
@ 2009-01-28 22:37 ` Roland Dreier
  2009-01-29  1:29   ` [ofa-general] " Andy Grover
  22 siblings, 1 reply; 64+ messages in thread
From: Roland Dreier @ 2009-01-28 22:37 UTC (permalink / raw)
  To: Andy Grover; +Cc: rds-devel, general, netdev

 > This patchset adds support for RDS as an Infiniband ULP. RDS is an
 > Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably,
 > and is used currently in Oracle RAC and Exadata products. It's lived
 > in OFED for 2+ years and I think it's time to get it upstream -- most
 > likely into your -next tree for .30, but if it snuck into .29 via the
 > "new code merge-window exception" then even better.

I'll read this over and comment, but to be honest I agree with Dave:
this is a new socket family, and as such it belongs under net/ and
probably should go through Dave's tree (just as the NFS/RDMA changes
went through the NFS trees and the 9p/RDMA changes went through the 9p
tree); even though it heavily uses RDMA I think the upper layer
interface to sockets/networking is more relevant.

 - R.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/21] RDS: Congestion-handling code
  2009-01-27  2:17 ` [PATCH 03/21] RDS: Congestion-handling code Andy Grover
  2009-01-27  3:48   ` Stephen Hemminger
  2009-01-27 13:10   ` Evgeniy Polyakov
@ 2009-01-28 22:57   ` Roland Dreier
  2009-01-29  2:39     ` [ofa-general] " Andy Grover
  2 siblings, 1 reply; 64+ messages in thread
From: Roland Dreier @ 2009-01-28 22:57 UTC (permalink / raw)
  To: Andy Grover; +Cc: rds-devel, general, netdev

 > +EXPORT_SYMBOL_GPL(rds_cong_map_updated);

What is this being exported to?  AFAICT you are only building a single
RDS module, right?

 - R.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 20/21] RDS: Kconfig and Makefile
  2009-01-27  2:17 ` [PATCH 20/21] RDS: Kconfig and Makefile Andy Grover
@ 2009-01-28 22:59   ` Roland Dreier
  2009-01-29  2:19     ` [ofa-general] " Andy Grover
  0 siblings, 1 reply; 64+ messages in thread
From: Roland Dreier @ 2009-01-28 22:59 UTC (permalink / raw)
  To: Andy Grover; +Cc: rds-devel, general, netdev

 > +obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/rds/

Typo for ..._RDS

 > +config INFINIBAND_RDS_DEBUG
 > +        bool "Debugging messages"
 > +	depends on INFINIBAND_RDS
 > +        default n

No way to enable this?  Disabled by default?

You really want debugging messages to be built by default and controlled
at runtime ... otherwise debugging end-user installations is a pain
(they just install what the distro gives them, and it's very hard for
them to rebuild just to enable debugging).

 > +ib_rds-y :=	af_rds.o bind.o cong.o connection.o info.o message.o   \
 > +			recv.o send.o stats.o sysctl.o threads.o transport.o \
 > +			loop.o page.o rdma.o
 > +
 > +ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
 > +			ib_sysctl.o ib_rdma.o

a very strange way to write an assignment statement...

 - R.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB
  2009-01-27  2:17 ` [PATCH 17/21] RDS/IB: Receive datagrams via IB Andy Grover
@ 2009-01-29  0:05   ` Roland Dreier
  2009-01-29  2:20     ` Andy Grover
  0 siblings, 1 reply; 64+ messages in thread
From: Roland Dreier @ 2009-01-29  0:05 UTC (permalink / raw)
  To: Andy Grover; +Cc: netdev, rds-devel, general

 > +static int rds_ib_recv_refill_one(struct rds_connection *conn,
 > +				  struct rds_ib_recv_work *recv,
 > +				  gfp_t kptr_gfp, gfp_t page_gfp)
 > +{
 > +	struct rds_ib_connection *ic = conn->c_transport_data;
 > +	dma_addr_t dma_addr;
 > +	struct ib_sge *sge;
 > +	int ret = -ENOMEM;
 > +
 > +	if (recv->r_ibinc == NULL) {
 > +		if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) {
 > +			rds_ib_stats_inc(s_ib_rx_alloc_limit);
 > +			goto out;
 > +		}
 > +		recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab,
 > +						 kptr_gfp);
 > +		if (recv->r_ibinc == NULL)
 > +			goto out;
 > +		atomic_inc(&rds_ib_allocation);

This is racy.  You check if you're at the limit, do the allocation, and
then increment the atomic rds_ib_allocation count.  So many threads can
pass the atomic_read() test and then take you over the limit.  If you
want to make it safe then you could do atomic_inc_return() and check if
that took you over the limit.

 - R.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 0/21] Reliable Datagram Sockets (RDS)
  2009-01-28 22:37 ` Roland Dreier
@ 2009-01-29  1:29   ` Andy Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-29  1:29 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, rds-devel, general

Roland Dreier wrote:
>  > This patchset adds support for RDS as an Infiniband ULP. RDS is an
>  > Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably,
>  > and is used currently in Oracle RAC and Exadata products. It's lived
>  > in OFED for 2+ years and I think it's time to get it upstream -- most
>  > likely into your -next tree for .30, but if it snuck into .29 via the
>  > "new code merge-window exception" then even better.
> 
> I'll read this over and comment, but to be honest I agree with Dave:
> this is a new socket family, and as such it belongs under net/ and
> probably should go through Dave's tree (just as the NFS/RDMA changes
> went through the NFS trees and the 9p/RDMA changes went through the 9p
> tree); even though it heavily uses RDMA I think the upper layer
> interface to sockets/networking is more relevant.

OK no prob. I'll probably be ready early next week with an updated patchset.

Thanks -- Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 20/21] RDS: Kconfig and Makefile
  2009-01-28 22:59   ` Roland Dreier
@ 2009-01-29  2:19     ` Andy Grover
  2009-01-29  5:14       ` Roland Dreier
  0 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-29  2:19 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, rds-devel, general

Roland Dreier wrote:
>  > +obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/rds/
> 
> Typo for ..._RDS

Whups. :)

>  > +config INFINIBAND_RDS_DEBUG
>  > +        bool "Debugging messages"
>  > +	depends on INFINIBAND_RDS
>  > +        default n
> 
> No way to enable this?  Disabled by default?
> 
> You really want debugging messages to be built by default and controlled
> at runtime ... otherwise debugging end-user installations is a pain
> (they just install what the distro gives them, and it's very hard for
> them to rebuild just to enable debugging).

So the solution is just to base debug message output on a variable,
instead of a config option? RDS actually does do this a little already,
so converting totally isn't hard. I hadn't seen mention this was
preferable -- indeed, tons of drivers and subsystems have options for
compile-time debug statements, should these be converted?

>  > +ib_rds-y :=	af_rds.o bind.o cong.o connection.o info.o message.o   \
>  > +			recv.o send.o stats.o sysctl.o threads.o transport.o \
>  > +			loop.o page.o rdma.o
>  > +
>  > +ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
>  > +			ib_sysctl.o ib_rdma.o
> 
> a very strange way to write an assignment statement...

RDS is implemented as a core sockets layer and then a transport layer.
IB is currently the only transport so I thought it made sense to just
compile them together, but once there are >1 then RDS's IB support could
be broken out into its own module.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB
  2009-01-29  0:05   ` [ofa-general] " Roland Dreier
@ 2009-01-29  2:20     ` Andy Grover
  2009-01-29 21:02       ` Olaf Kirch
  0 siblings, 1 reply; 64+ messages in thread
From: Andy Grover @ 2009-01-29  2:20 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, rds-devel, general

Roland Dreier wrote:
>  > +static int rds_ib_recv_refill_one(struct rds_connection *conn,
>  > +				  struct rds_ib_recv_work *recv,
>  > +				  gfp_t kptr_gfp, gfp_t page_gfp)
>  > +{
>  > +	struct rds_ib_connection *ic = conn->c_transport_data;
>  > +	dma_addr_t dma_addr;
>  > +	struct ib_sge *sge;
>  > +	int ret = -ENOMEM;
>  > +
>  > +	if (recv->r_ibinc == NULL) {
>  > +		if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) {
>  > +			rds_ib_stats_inc(s_ib_rx_alloc_limit);
>  > +			goto out;
>  > +		}
>  > +		recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab,
>  > +						 kptr_gfp);
>  > +		if (recv->r_ibinc == NULL)
>  > +			goto out;
>  > +		atomic_inc(&rds_ib_allocation);
> 
> This is racy.  You check if you're at the limit, do the allocation, and
> then increment the atomic rds_ib_allocation count.  So many threads can
> pass the atomic_read() test and then take you over the limit.  If you
> want to make it safe then you could do atomic_inc_return() and check if
> that took you over the limit.

Woah, yup, thanks.

-- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 03/21] RDS: Congestion-handling code
  2009-01-28 22:57   ` Roland Dreier
@ 2009-01-29  2:39     ` Andy Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andy Grover @ 2009-01-29  2:39 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, rds-devel, general

Roland Dreier wrote:
>  > +EXPORT_SYMBOL_GPL(rds_cong_map_updated);
> 
> What is this being exported to?  AFAICT you are only building a single
> RDS module, right?

In the current RDS development repo, transports are modularizable. For
the initial upstream submission I just wanted to include the IB
transport so I changed the build to compile rds-core and rds-ib
together, but didn't pull out the exports.

Steve Wise is working on having the iwarp transport debugged in the near
future. Once that is added and we make transports modularizable then
those exports are needed.

Take them out for now, then?

Thanks -- Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* ***SPAM*** Re: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling
  2009-01-27 16:28       ` [ofa-general] " Steve Wise
@ 2009-01-29  3:03         ` Andrew Grover
  2009-01-29  8:03           ` Evgeniy Polyakov
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Grover @ 2009-01-29  3:03 UTC (permalink / raw)
  To: Steve Wise
  Cc: rds-devel, netdev, rdreier, Oliver Neukum, general, Evgeniy Polyakov

On Tue, Jan 27, 2009 at 8:28 AM, Steve Wise <swise@opengridcomputing.com> wrote:
> Oliver Neukum wrote:
>>
>> Am Tuesday 27 January 2009 14:34:19 schrieb Evgeniy Polyakov:
>>
>>>
>>> On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover
>>> (andy.grover@oracle.com) wrote:
>>>
>>>>
>>>> +static inline int rds_conn_is_sending(struct rds_connection *conn)
>>>> +{
>>>> +       int ret = 0;
>>>> +
>>>> +       if (!mutex_trylock(&conn->c_send_lock))
>>>> +               ret = 1;
>>>> +       else
>>>> +               mutex_unlock(&conn->c_send_lock);
>>>> +
>>>> +       return ret;
>>>> +}
>>>> +
>>>>
>>>
>>> This one is eventually invoked under the spin_lock with turned off irqs,
>>> which may freeze the machine:
>>> rds_for_each_conn_info() -> spin_lock_irqsave(global lock) ->
>>> rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending()
>>> -> boom.
>>>
>>
>> Why? This is _trylock. It won't block.
>>
>>
>
> mutex_trylock() uses spin_lock_mutex() which has this in the debug version:
>
> DEBUG_LOCKS_WARN_ON(in_interrupt());

What's the best way to fix this?

This is all so rds-info can print out a nice list of connections, and
if they're sending or not. I don't see an easy way to fix this. A
_trylock-like function that didn't grab it would be nice? I can always
just not report this particular bit of info, that actually might be
easiest.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 01/21] RDS: Socket interface
  2009-01-27  3:46   ` Stephen Hemminger
@ 2009-01-29  3:17     ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-29  3:17 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, rdreier, rds-devel, general

On Mon, Jan 26, 2009 at 7:46 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
>> +/* this is just used for stats gathering :/ */
>
> Then I would think a high speed protocol would use per-cpu
> and/or rcu.

For a spinlock guarding a socket list? I wouldn't think it would be
worth the complexity.

[other comments snipped, will fix per your advice, thanks!]

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 01/21] RDS: Socket interface
  2009-01-27 12:08   ` Evgeniy Polyakov
@ 2009-01-29  4:02     ` Andrew Grover
  2009-01-29 16:24       ` Evgeniy Polyakov
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Grover @ 2009-01-29  4:02 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev, rdreier, rds-devel, general

On Tue, Jan 27, 2009 at 4:08 AM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> Hi Andy.

Hi Evgeniy thanks for your time in reviewing.

>> +/* this is just used for stats gathering :/ */
>
> Shouldn't this be some kind of per-cpu data?
> Global list of all sockets? This does not scale, maybe it should be
> groupped into hash table or be per-device?

sch mentioned this too... is socket creation often a bottleneck? If so
we can certainly improve scalability here.

In any case, this is in the code to support a listing of RDS sockets
via the rds-info utility. Instead of having our own custom program to
list rds sockets we probably want to export an interface so netstat
will list them. Unfortunately netstat seems to be hardcoded to look
for particular entries in /proc/net, so both rds and netstat would
need to be updated before this would work, and RDS's custom
socket-listing interface dropped.

>> +static int rds_release(struct socket *sock)
>> +{
>> +     struct sock *sk = sock->sk;
>> +     struct rds_sock *rs;
>> +     unsigned long flags;
>> +
>> +     if (sk == NULL)
>> +             goto out;
>> +
>> +     rs = rds_sk_to_rs(sk);
>> +
>> +     sock_orphan(sk);
>
> Why is it needed getting socket is about to be freed?

from the comments above that code:

 * We have to be careful about racing with the incoming path.  sock_orphan()
 * sets SOCK_DEAD and we use that as an indicator to the rx path that new
 * messages shouldn't be queued.

Is that an appropriate usage of sock_orphan()?

> Does RDS sockets work with high number of creation/destruction
> workloads?

I'd guess from your comments that performance probably wouldn't be great :)

>> +static unsigned int rds_poll(struct file *file, struct socket *sock,
>> +                          poll_table *wait)
>> +{
>> +     struct sock *sk = sock->sk;
>> +     struct rds_sock *rs = rds_sk_to_rs(sk);
>> +     unsigned int mask = 0;
>> +     unsigned long flags;
>> +
>> +     poll_wait(file, sk->sk_sleep, wait);
>> +
>> +     poll_wait(file, &rds_poll_waitq, wait);
>> +
>
> Are you absolutely sure that provided poll_table callback
> will not do the bad things here? It is quite unusual to add several
> different queues into the same head in the poll callback.
> And shouldn't rds_poll_waitq be lock protected here?

I don't know. I looked into the poll_wait code a little and it
appeared to be designed to allow multiple.

I'm not very strong in this area and would love some more expert input here.

>> +     read_lock_irqsave(&rs->rs_recv_lock, flags);
>> +     if (!rs->rs_cong_monitor) {
>> +             /* When a congestion map was updated, we signal POLLIN for
>> +              * "historical" reasons. Applications can also poll for
>> +              * WRBAND instead. */
>> +             if (rds_cong_updated_since(&rs->rs_cong_track))
>> +                     mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
>> +     } else {
>> +             spin_lock(&rs->rs_lock);
>
> Is there a possibility to have lock iteraction problem with above
> rs_recv_lock read lock?

I didn't see anywhere where they were being acquired in reverse order,
or simultaneously. This is the kind of thing that lockdep would find
immediately, right? I think I've got that turned on but I'll double
check.

>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
>
> This should be dropped in the mainline tree.

yup.

> Hash table with the appropriate size will have faster lookup/access
> times btw.

No doubt. Definitely want to make this improvement at some point.

>> +static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port,
>> +                                        struct rds_sock *insert)
>> +{
>> +     struct rb_node **p = &rds_bind_tree.rb_node;
>> +     struct rb_node *parent = NULL;
>> +     struct rds_sock *rs;
>> +     u64 cmp;
>> +     u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
>> +
>> +     while (*p) {
>> +             parent = *p;
>> +             rs = rb_entry(parent, struct rds_sock, rs_bound_node);
>> +
>> +             cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
>> +                   be16_to_cpu(rs->rs_bound_port);
>> +
>> +             if (needle < cmp)
>
> Should it use wrapping logic if some field overflows?

Sorry, please explain?

>> +     rdsdebug("returning rs %p for %u.%u.%u.%u:%u\n", rs, NIPQUAD(addr),
>> +             ntohs(port));
>
> Iirc there is a new %pi4 or similar format id.

Yup, will do.

Thanks again.

Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 20/21] RDS: Kconfig and Makefile
  2009-01-29  2:19     ` [ofa-general] " Andy Grover
@ 2009-01-29  5:14       ` Roland Dreier
  0 siblings, 0 replies; 64+ messages in thread
From: Roland Dreier @ 2009-01-29  5:14 UTC (permalink / raw)
  To: Andy Grover; +Cc: netdev, rds-devel, general

 > So the solution is just to base debug message output on a variable,
 > instead of a config option? RDS actually does do this a little already,
 > so converting totally isn't hard. I hadn't seen mention this was
 > preferable -- indeed, tons of drivers and subsystems have options for
 > compile-time debug statements, should these be converted?

My experience is definitely that compile-time switches are a big pain
when you actually have to debug something that can only be reproduced on
someone else's setup (which will happen once users start using your
stuff).  You probably can use the dynamic_printk stuff that went in
recently to make this all very clean and standard.

 - R.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling
  2009-01-29  3:03         ` ***SPAM*** " Andrew Grover
@ 2009-01-29  8:03           ` Evgeniy Polyakov
  0 siblings, 0 replies; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-29  8:03 UTC (permalink / raw)
  To: Andrew Grover
  Cc: Steve Wise, Oliver Neukum, netdev, rdreier, rds-devel, general

On Wed, Jan 28, 2009 at 07:03:49PM -0800, Andrew Grover (andy.grover@gmail.com) wrote:
> What's the best way to fix this?
> 
> This is all so rds-info can print out a nice list of connections, and
> if they're sending or not. I don't see an easy way to fix this. A
> _trylock-like function that didn't grab it would be nice? I can always
> just not report this particular bit of info, that actually might be
> easiest.

You use atomic variables for the other cases, add another one here to
mark locked connection. Looks ugly but does not crash at least.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 01/21] RDS: Socket interface
  2009-01-29  4:02     ` [ofa-general] " Andrew Grover
@ 2009-01-29 16:24       ` Evgeniy Polyakov
  0 siblings, 0 replies; 64+ messages in thread
From: Evgeniy Polyakov @ 2009-01-29 16:24 UTC (permalink / raw)
  To: Andrew Grover; +Cc: netdev, rdreier, rds-devel, general

Hi Andy.

On Wed, Jan 28, 2009 at 08:02:49PM -0800, Andrew Grover (andy.grover@gmail.com) wrote:
> Hi Evgeniy thanks for your time in reviewing.

No problem :)

> >> +/* this is just used for stats gathering :/ */
> >
> > Shouldn't this be some kind of per-cpu data?
> > Global list of all sockets? This does not scale, maybe it should be
> > groupped into hash table or be per-device?
> 
> sch mentioned this too... is socket creation often a bottleneck? If so
> we can certainly improve scalability here.

It depends on the workload, but and it becomes a noticeble portion of
the overhead for multi-client short-living connections. Likely this
sockets will not be used for web-server like load, but something similar
will clearly show a bottleneck with this list.

> In any case, this is in the code to support a listing of RDS sockets
> via the rds-info utility. Instead of having our own custom program to
> list rds sockets we probably want to export an interface so netstat
> will list them. Unfortunately netstat seems to be hardcoded to look
> for particular entries in /proc/net, so both rds and netstat would
> need to be updated before this would work, and RDS's custom
> socket-listing interface dropped.

Other sockets use similar technique, but they are groupped into hash
table, so if you think that amount of socket will be noticebly large or
they will be frequently created and removed, it may worth pushing them
into the hash table.

> Is that an appropriate usage of sock_orphan()?

It is, I missed the process context socket detouch, things are correct.

> > Does RDS sockets work with high number of creation/destruction
> > workloads?
> 
> I'd guess from your comments that performance probably wouldn't be great :)

Something tells me the same :)

> >> +static unsigned int rds_poll(struct file *file, struct socket *sock,
> >> +                          poll_table *wait)
> >> +{
> >> +     struct sock *sk = sock->sk;
> >> +     struct rds_sock *rs = rds_sk_to_rs(sk);
> >> +     unsigned int mask = 0;
> >> +     unsigned long flags;
> >> +
> >> +     poll_wait(file, sk->sk_sleep, wait);
> >> +
> >> +     poll_wait(file, &rds_poll_waitq, wait);
> >> +
> >
> > Are you absolutely sure that provided poll_table callback
> > will not do the bad things here? It is quite unusual to add several
> > different queues into the same head in the poll callback.
> > And shouldn't rds_poll_waitq be lock protected here?
> 
> I don't know. I looked into the poll_wait code a little and it
> appeared to be designed to allow multiple.
> 
> I'm not very strong in this area and would love some more expert input here.

It depends on how poll_table was initialized and how its callback
(invoked from the poll_wait()) operates with the given queue and head.
If you introduce own polling, some care has to be taken there for the
ordering of the wait queues and what their callbacks return when polling
even found.

For example with the own initialization it is possible that with
multiple queues are registered in the same table, only one of them
will be awakened (its callback invoked).

If you just hook into existing machinery things should be ok though, so
this is just something to pay attention and does not show a bug.

> >> +     read_lock_irqsave(&rs->rs_recv_lock, flags);
> >> +     if (!rs->rs_cong_monitor) {
> >> +             /* When a congestion map was updated, we signal POLLIN for
> >> +              * "historical" reasons. Applications can also poll for
> >> +              * WRBAND instead. */
> >> +             if (rds_cong_updated_since(&rs->rs_cong_track))
> >> +                     mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
> >> +     } else {
> >> +             spin_lock(&rs->rs_lock);
> >
> > Is there a possibility to have lock iteraction problem with above
> > rs_recv_lock read lock?
> 
> I didn't see anywhere where they were being acquired in reverse order,
> or simultaneously. This is the kind of thing that lockdep would find
> immediately, right? I think I've got that turned on but I'll double
> check.

If lockdep entered the bad race, then yes, it will fire this up.
I just wondered that we spin lock under the read lock, so some bad
iteration with writelock may happen.

> >> +             cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
> >> +                   be16_to_cpu(rs->rs_bound_port);
> >> +
> >> +             if (needle < cmp)
> >
> > Should it use wrapping logic if some field overflows?
> 
> Sorry, please explain?

I meant that if there is an unsigned overflow this will suddenly become
a small number, so network timestamping comparison logic can be used,
but apparently neither address nor port are changed during the lifetime,
so nothing special is needed.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] ***SPAM*** Re: [PATCH 01/21] RDS: Socket interface
  2009-01-27  4:11   ` David Miller
@ 2009-01-29 20:22     ` Andrew Grover
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Grover @ 2009-01-29 20:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, rdreier, rds-devel, general

On Mon, Jan 26, 2009 at 8:11 PM, David Miller <davem@davemloft.net> wrote:
> Socket family implementations do not belong under the
> infiniband subdirectory.
>
> Put it under net/ instead.
>
> I don't care what the interdependencies happen to be.

Roland, so you're ok with infiniband code under net/ ?

(or should I split it up??)

Thanks -- Regards -- Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB
  2009-01-29  2:20     ` Andy Grover
@ 2009-01-29 21:02       ` Olaf Kirch
  2009-01-29 21:47         ` [ofa-general] " Roland Dreier
  0 siblings, 1 reply; 64+ messages in thread
From: Olaf Kirch @ 2009-01-29 21:02 UTC (permalink / raw)
  To: Andy Grover; +Cc: Roland Dreier, rds-devel, general, netdev

On Thursday 29 January 2009 03:20:40 Andy Grover wrote:
> > This is racy.  You check if you're at the limit, do the allocation, and
> > then increment the atomic rds_ib_allocation count.  So many threads can
> > pass the atomic_read() test and then take you over the limit.  If you
> > want to make it safe then you could do atomic_inc_return() and check if
> > that took you over the limit.
>
> Woah, yup, thanks.

The refill code used to be single-threaded; and I think it still is. So
this can't race I think

Olaf

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [ofa-general] Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB
  2009-01-29 21:02       ` Olaf Kirch
@ 2009-01-29 21:47         ` Roland Dreier
  0 siblings, 0 replies; 64+ messages in thread
From: Roland Dreier @ 2009-01-29 21:47 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: netdev, rds-devel, general

 > > > This is racy.  You check if you're at the limit, do the allocation, and
 > > > then increment the atomic rds_ib_allocation count.  So many threads can
 > > > pass the atomic_read() test and then take you over the limit.  If you
 > > > want to make it safe then you could do atomic_inc_return() and check if
 > > > that took you over the limit.
 > >
 > > Woah, yup, thanks.
 > 
 > The refill code used to be single-threaded; and I think it still is. So
 > this can't race I think

So you don't need the atomic op at all?

 - R.

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2009-01-29 21:47 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-27  2:17 [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Andy Grover
2009-01-27  2:17 ` [ofa-general] [PATCH 01/21] RDS: Socket interface Andy Grover
2009-01-27  3:46   ` Stephen Hemminger
2009-01-29  3:17     ` [ofa-general] " Andrew Grover
2009-01-27  4:11   ` David Miller
2009-01-29 20:22     ` [ofa-general] ***SPAM*** " Andrew Grover
2009-01-27 12:08   ` Evgeniy Polyakov
2009-01-29  4:02     ` [ofa-general] " Andrew Grover
2009-01-29 16:24       ` Evgeniy Polyakov
2009-01-27  2:17 ` [ofa-general] [PATCH 02/21] RDS: Main header file Andy Grover
2009-01-27  7:34   ` Rémi Denis-Courmont
2009-01-27 19:27     ` [ofa-general] " Andrew Grover
2009-01-27 13:05   ` Evgeniy Polyakov
2009-01-27 19:23     ` [ofa-general] ***SPAM*** " Andrew Grover
2009-01-27 19:24       ` Steve Wise
2009-01-27  2:17 ` [PATCH 03/21] RDS: Congestion-handling code Andy Grover
2009-01-27  3:48   ` Stephen Hemminger
2009-01-27 19:15     ` Andrew Grover
2009-01-27 13:10   ` Evgeniy Polyakov
2009-01-27 19:10     ` Andrew Grover
2009-01-28 22:57   ` Roland Dreier
2009-01-29  2:39     ` [ofa-general] " Andy Grover
2009-01-27  2:17 ` [PATCH 04/21] RDS: Transport code Andy Grover
2009-01-27 13:18   ` Evgeniy Polyakov
2009-01-27 19:36     ` Andrew Grover
2009-01-27 21:56       ` [ofa-general] " Evgeniy Polyakov
2009-01-27 22:15         ` [ofa-general] ***SPAM*** " Andrew Grover
2009-01-27  2:17 ` [ofa-general] [PATCH 05/21] RDS: Info and stats Andy Grover
2009-01-27 13:28   ` Evgeniy Polyakov
2009-01-27  2:17 ` [PATCH 06/21] RDS: Connection handling Andy Grover
2009-01-27 13:34   ` Evgeniy Polyakov
2009-01-27 13:47     ` Oliver Neukum
2009-01-27 13:51       ` Evgeniy Polyakov
2009-01-27 16:28       ` [ofa-general] " Steve Wise
2009-01-29  3:03         ` ***SPAM*** " Andrew Grover
2009-01-29  8:03           ` Evgeniy Polyakov
2009-01-27  2:17 ` [PATCH 07/21] RDS: loopback Andy Grover
2009-01-27  2:17 ` [PATCH 08/21] RDS: sysctls Andy Grover
2009-01-27  2:17 ` [PATCH 09/21] RDS: Message parsing Andy Grover
2009-01-27  2:17 ` [PATCH 10/21] RDS: send.c Andy Grover
2009-01-27  2:17 ` [PATCH 11/21] RDS: recv.c Andy Grover
2009-01-27  2:17 ` [PATCH 12/21] RDS: RDMA support Andy Grover
2009-01-27  2:17 ` [ofa-general] [PATCH 13/21] RDS/IB: Infiniband transport Andy Grover
2009-01-27  2:17 ` [PATCH 14/21] RDS/IB: Ring-handling code Andy Grover
2009-01-27  2:17 ` [PATCH 15/21] RDS/IB: Implement RDMA ops using FMRs Andy Grover
2009-01-27  2:17 ` [PATCH 16/21] RDS/IB: Implement IB-specific datagram send Andy Grover
2009-01-27  2:17 ` [PATCH 17/21] RDS/IB: Receive datagrams via IB Andy Grover
2009-01-29  0:05   ` [ofa-general] " Roland Dreier
2009-01-29  2:20     ` Andy Grover
2009-01-29 21:02       ` Olaf Kirch
2009-01-29 21:47         ` [ofa-general] " Roland Dreier
2009-01-27  2:17 ` [PATCH 18/21] RDS/IB: Stats and sysctls Andy Grover
2009-01-27  2:17 ` [PATCH 19/21] RDS: Documentation Andy Grover
2009-01-27  2:17 ` [PATCH 20/21] RDS: Kconfig and Makefile Andy Grover
2009-01-28 22:59   ` Roland Dreier
2009-01-29  2:19     ` [ofa-general] " Andy Grover
2009-01-29  5:14       ` Roland Dreier
2009-01-27  2:17 ` [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets Andy Grover
2009-01-27  7:27   ` Rémi Denis-Courmont
2009-01-27 19:31     ` [ofa-general] " Andrew Grover
2009-01-27 15:34 ` [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Steve Wise
2009-01-27 19:29   ` ***SPAM*** " Andrew Grover
2009-01-28 22:37 ` Roland Dreier
2009-01-29  1:29   ` [ofa-general] " Andy Grover

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.