linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
@ 2014-02-21 21:53 Jiri Kosina
       [not found] ` <CAJZOPZK4Ah+nKPWnX3=yM43jbf586GYJ+fh0-OL4bOnqKK8v8A@mail.gmail.com>
  2014-03-06 13:31 ` Or Gerlitz
  0 siblings, 2 replies; 22+ messages in thread
From: Jiri Kosina @ 2014-02-21 21:53 UTC (permalink / raw)
  To: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman
  Cc: netdev, linux-kernel

This was originally a patch from Matthew Finlay <matt@mellanox.com> that 
addressed a problem whereby NFS writes would enter uninterruptible sleep 
forever.  The issue happened when using NFS over IPoIB. This is not a 
recommended configuration as RDMA is preferred but it is still a valid 
configuration and is important to have in situations where the NFS server 
does not support RDMA. The problem encountered was described as follows:

	It's not memory reclamation that is the problem as such. There is
	an indirect dependency between network filesystems writing back
	pages and ipoib_cm_tx_init() due to how a kworker is used. Page
	reclaim cannot make forward progress until ipoib_cm_tx_init()
	succeeds and it is stuck in page reclaim itself waiting for network
	transmission. Ordinarily this sitaution may be avoided by having
	the caller use GFP_NOFS but ipoib_cm_tx_init() does not have
	that information.

The patch has been ported to newer kernels by Mel Gorman and later ported 
further by Jiri Kosina. 

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---


I'd like to get confirmation from Matt that he's fine with having his 
Signed-off-by: on this, but he's unfortunately not responding to any of my 
queries.

Any ideas for cleaner fix are more than welcome. We've been carrying this 
patch in SUSE kernel tree to fix a real reported issue for quite some time 
already.


 drivers/infiniband/hw/mlx4/cq.c                    |    6 ++--
 drivers/infiniband/hw/mlx4/qp.c                    |   36 +++++++++++++++-----
 drivers/infiniband/hw/mlx4/srq.c                   |    7 ++--
 drivers/infiniband/ulp/ipoib/ipoib_cm.c            |   16 ++++++++-
 drivers/net/ethernet/mellanox/mlx4/alloc.c         |   35 ++++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/cq.c            |    4 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |    6 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |    2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.c           |   10 ++++--
 drivers/net/ethernet/mellanox/mlx4/icm.h           |    3 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h          |    4 +-
 drivers/net/ethernet/mellanox/mlx4/mr.c            |   17 +++++----
 drivers/net/ethernet/mellanox/mlx4/qp.c            |   21 ++++++-----
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    4 +-
 drivers/net/ethernet/mellanox/mlx4/srq.c           |    4 +-
 include/linux/mlx4/device.h                        |   10 +++--
 16 files changed, 117 insertions(+), 68 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index cc40f08..661185a 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -102,7 +102,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
 	int err;
 
 	err = mlx4_buf_alloc(dev->dev, nent * dev->dev->caps.cqe_size,
-			     PAGE_SIZE * 2, &buf->buf);
+			     PAGE_SIZE * 2, &buf->buf, 0);
 
 	if (err)
 		goto out;
@@ -113,7 +113,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
 	if (err)
 		goto err_buf;
 
-	err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf);
+	err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf, 0);
 	if (err)
 		goto err_mtt;
 
@@ -209,7 +209,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
 
 		uar = &to_mucontext(context)->uar;
 	} else {
-		err = mlx4_db_alloc(dev->dev, &cq->db, 1);
+		err = mlx4_db_alloc(dev->dev, &cq->db, 1, 0);
 		if (err)
 			goto err_cq;
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index d8f4d1f..1379ee7 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -744,14 +744,18 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			goto err;
 
 		if (qp_has_rq(init_attr)) {
-			err = mlx4_db_alloc(dev->dev, &qp->db, 0);
+			err = mlx4_db_alloc(dev->dev, &qp->db, 0,
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS);
 			if (err)
 				goto err;
 
 			*qp->db.db = 0;
 		}
 
-		if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) {
+		if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf,
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS)) {
 			err = -ENOMEM;
 			goto err_db;
 		}
@@ -761,12 +765,20 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		if (err)
 			goto err_buf;
 
-		err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf);
+		err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf,
+				init_attr->create_flags &
+					IB_QP_CREATE_USE_GFP_NOFS);
 		if (err)
 			goto err_mtt;
 
-		qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
-		qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
+		qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64),
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS ?
+							GFP_NOFS : GFP_KERNEL);
+		qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64),
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS ?
+							GFP_NOFS : GFP_KERNEL);
 
 		if (!qp->sq.wrid || !qp->rq.wrid) {
 			err = -ENOMEM;
@@ -797,7 +809,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			goto err_proxy;
 	}
 
-	err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp);
+	err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp,
+			init_attr->create_flags & IB_QP_CREATE_USE_GFP_NOFS);
 	if (err)
 		goto err_qpn;
 
@@ -1024,10 +1037,12 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 					MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK |
 					MLX4_IB_SRIOV_TUNNEL_QP |
 					MLX4_IB_SRIOV_SQP |
-					MLX4_IB_QP_NETIF))
+					MLX4_IB_QP_NETIF |
+					IB_QP_CREATE_USE_GFP_NOFS))
 		return ERR_PTR(-EINVAL);
 
-	if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP) {
+	if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP &&
+	    init_attr->create_flags & ~IB_QP_CREATE_USE_GFP_NOFS) {
 		if (init_attr->qp_type != IB_QPT_UD)
 			return ERR_PTR(-EINVAL);
 	}
@@ -1054,7 +1069,10 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	case IB_QPT_RC:
 	case IB_QPT_UC:
 	case IB_QPT_RAW_PACKET:
-		qp = kzalloc(sizeof *qp, GFP_KERNEL);
+		qp = kzalloc(sizeof *qp,
+				init_attr->create_flags &
+				IB_QP_CREATE_USE_GFP_NOFS ?
+					GFP_NOFS : GFP_KERNEL);
 		if (!qp)
 			return ERR_PTR(-ENOMEM);
 		/* fall through */
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 60c5fb0..17552c0 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -134,13 +134,14 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
 		if (err)
 			goto err_mtt;
 	} else {
-		err = mlx4_db_alloc(dev->dev, &srq->db, 0);
+		err = mlx4_db_alloc(dev->dev, &srq->db, 0, 0);
 		if (err)
 			goto err_srq;
 
 		*srq->db.db = 0;
 
-		if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2, &srq->buf)) {
+		if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2,
+					&srq->buf, 0)) {
 			err = -ENOMEM;
 			goto err_db;
 		}
@@ -165,7 +166,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
 		if (err)
 			goto err_buf;
 
-		err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf);
+		err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf, 0);
 		if (err)
 			goto err_mtt;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 1377f85..b6dd279 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -48,6 +48,13 @@ MODULE_PARM_DESC(max_nonsrq_conn_qp,
 		 "Max number of connected-mode QPs per interface "
 		 "(applied only if shared receive queue is not available)");
 
+int ipoib_use_gfp_nofs = 0;
+
+module_param_named(use_gfp_nofs, ipoib_use_gfp_nofs, int, 0444);
+MODULE_PARM_DESC(use_gfp_nofs,
+		 "Use GFP_NOFS flags when allocating memory during the TX "
+		 "path for CM.  This should be used when running NFS over IPoIB.");
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
 static int data_debug_level;
 
@@ -1030,7 +1037,9 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 		.cap.max_send_sge	= 1,
 		.sq_sig_type		= IB_SIGNAL_ALL_WR,
 		.qp_type		= IB_QPT_RC,
-		.qp_context		= tx
+		.qp_context		= tx,
+		.create_flags		= ipoib_use_gfp_nofs ?
+						IB_QP_CREATE_USE_GFP_NOFS : 0
 	};
 
 	return ib_create_qp(priv->pd, &attr);
@@ -1104,12 +1113,15 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn,
 	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
 	int ret;
 
-	p->tx_ring = vzalloc(ipoib_sendq_size * sizeof *p->tx_ring);
+	p->tx_ring = __vmalloc(ipoib_sendq_size * sizeof *p->tx_ring,
+				ipoib_use_gfp_nofs ? GFP_NOFS : GFP_KERNEL,
+				PAGE_KERNEL);
 	if (!p->tx_ring) {
 		ipoib_warn(priv, "failed to allocate tx ring\n");
 		ret = -ENOMEM;
 		goto err_tx;
 	}
+	memset(p->tx_ring, 0, ipoib_sendq_size * sizeof *p->tx_ring);
 
 	p->qp = ipoib_cm_create_tx_qp(p->dev, p);
 	if (IS_ERR(p->qp)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c b/drivers/net/ethernet/mellanox/mlx4/alloc.c
index c3ad464..60ca7f1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -171,7 +171,7 @@ void mlx4_bitmap_cleanup(struct mlx4_bitmap *bitmap)
  */
 
 int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
-		   struct mlx4_buf *buf)
+		   struct mlx4_buf *buf, int use_gfp_nofs)
 {
 	dma_addr_t t;
 
@@ -180,7 +180,9 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 		buf->npages       = 1;
 		buf->page_shift   = get_order(size) + PAGE_SHIFT;
 		buf->direct.buf   = dma_alloc_coherent(&dev->pdev->dev,
-						       size, &t, GFP_KERNEL);
+						       size, &t,
+						       use_gfp_nofs ?
+							GFP_NOFS : GFP_KERNEL);
 		if (!buf->direct.buf)
 			return -ENOMEM;
 
@@ -200,14 +202,16 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 		buf->npages      = buf->nbufs;
 		buf->page_shift  = PAGE_SHIFT;
 		buf->page_list   = kcalloc(buf->nbufs, sizeof(*buf->page_list),
-					   GFP_KERNEL);
+					   use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 		if (!buf->page_list)
 			return -ENOMEM;
 
 		for (i = 0; i < buf->nbufs; ++i) {
 			buf->page_list[i].buf =
 				dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE,
-						   &t, GFP_KERNEL);
+						   &t,
+						   use_gfp_nofs ?
+							GFP_NOFS : GFP_KERNEL);
 			if (!buf->page_list[i].buf)
 				goto err_free;
 
@@ -218,7 +222,8 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 
 		if (BITS_PER_LONG == 64) {
 			struct page **pages;
-			pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL);
+			pages = kmalloc(sizeof *pages * buf->nbufs,
+					use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 			if (!pages)
 				goto err_free;
 			for (i = 0; i < buf->nbufs; ++i)
@@ -260,11 +265,12 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf)
 }
 EXPORT_SYMBOL_GPL(mlx4_buf_free);
 
-static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
+static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device,
+						 int use_gfp_nofs)
 {
 	struct mlx4_db_pgdir *pgdir;
 
-	pgdir = kzalloc(sizeof *pgdir, GFP_KERNEL);
+	pgdir = kzalloc(sizeof *pgdir, use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 	if (!pgdir)
 		return NULL;
 
@@ -272,7 +278,9 @@ static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
 	pgdir->bits[0] = pgdir->order0;
 	pgdir->bits[1] = pgdir->order1;
 	pgdir->db_page = dma_alloc_coherent(dma_device, PAGE_SIZE,
-					    &pgdir->db_dma, GFP_KERNEL);
+					    &pgdir->db_dma,
+					    use_gfp_nofs ?
+						GFP_NOFS : GFP_KERNEL);
 	if (!pgdir->db_page) {
 		kfree(pgdir);
 		return NULL;
@@ -312,7 +320,8 @@ found:
 	return 0;
 }
 
-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+		  int use_gfp_nofs)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_db_pgdir *pgdir;
@@ -324,7 +333,7 @@ int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
 		if (!mlx4_alloc_db_from_pgdir(pgdir, db, order))
 			goto out;
 
-	pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev));
+	pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev), use_gfp_nofs);
 	if (!pgdir) {
 		ret = -ENOMEM;
 		goto out;
@@ -376,13 +385,13 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
 {
 	int err;
 
-	err = mlx4_db_alloc(dev, &wqres->db, 1);
+	err = mlx4_db_alloc(dev, &wqres->db, 1, 0);
 	if (err)
 		return err;
 
 	*wqres->db.db = 0;
 
-	err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf);
+	err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf, 0);
 	if (err)
 		goto err_db;
 
@@ -391,7 +400,7 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
 	if (err)
 		goto err_buf;
 
-	err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf);
+	err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf, 0);
 	if (err)
 		goto err_mtt;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 0487121..9727175 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -173,11 +173,11 @@ int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn)
 	if (*cqn == -1)
 		return -ENOMEM;
 
-	err = mlx4_table_get(dev, &cq_table->table, *cqn);
+	err = mlx4_table_get(dev, &cq_table->table, *cqn, 0);
 	if (err)
 		goto err_out;
 
-	err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn);
+	err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn, 0);
 	if (err)
 		goto err_put;
 	return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 890922c..ee77284 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -944,7 +944,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn,
 	if (!context)
 		return -ENOMEM;
 
-	err = mlx4_qp_alloc(mdev->dev, qpn, qp);
+	err = mlx4_qp_alloc(mdev->dev, qpn, qp, 0);
 	if (err) {
 		en_err(priv, "Failed to allocate qp #%x\n", qpn);
 		goto out;
@@ -984,7 +984,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv)
 		en_err(priv, "Failed reserving drop qpn\n");
 		return err;
 	}
-	err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp);
+	err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp, 0);
 	if (err) {
 		en_err(priv, "Failed allocating drop qp\n");
 		mlx4_qp_release_range(priv->mdev->dev, qpn, 1);
@@ -1043,7 +1043,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 	}
 
 	/* Configure RSS indirection qp */
-	err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp);
+	err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp, 0);
 	if (err) {
 		en_err(priv, "Failed to allocate RSS indirection QP\n");
 		goto rss_err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 1345703..2f123bf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -124,7 +124,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	       ring->buf_size, (unsigned long long) ring->wqres.buf.direct.map);
 
 	ring->qpn = qpn;
-	err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp);
+	err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp, 0);
 	if (err) {
 		en_err(priv, "Failed allocating qp %d\n", ring->qpn);
 		goto err_map;
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 5fbf492..c83b4e6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -245,7 +245,8 @@ int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev)
 			MLX4_CMD_TIME_CLASS_B, MLX4_CMD_NATIVE);
 }
 
-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+		   int use_gfp_nofs)
 {
 	u32 i = (obj & (table->num_obj - 1)) /
 			(MLX4_TABLE_CHUNK_SIZE / table->obj_size);
@@ -259,7 +260,10 @@ int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
 	}
 
 	table->icm[i] = mlx4_alloc_icm(dev, MLX4_TABLE_CHUNK_SIZE >> PAGE_SHIFT,
-				       (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
+				       (table->lowmem ?
+						(use_gfp_nofs ?
+							GFP_NOFS : GFP_KERNEL) :
+					GFP_HIGHUSER) |
 				       __GFP_NOWARN, table->coherent);
 	if (!table->icm[i]) {
 		ret = -ENOMEM;
@@ -356,7 +360,7 @@ int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 	u32 i;
 
 	for (i = start; i <= end; i += inc) {
-		err = mlx4_table_get(dev, table, i);
+		err = mlx4_table_get(dev, table, i, 0);
 		if (err)
 			goto fail;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.h b/drivers/net/ethernet/mellanox/mlx4/icm.h
index dee67fa..2be6ac5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.h
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.h
@@ -71,7 +71,8 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages,
 				gfp_t gfp_mask, int coherent);
 void mlx4_free_icm(struct mlx4_dev *dev, struct mlx4_icm *icm, int coherent);
 
-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+		   int use_gfp_nofs);
 void mlx4_table_put(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
 int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 			 u32 start, u32 end);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index 6b65f77..2d73e12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -882,7 +882,7 @@ void mlx4_cleanup_cq_table(struct mlx4_dev *dev);
 void mlx4_cleanup_qp_table(struct mlx4_dev *dev);
 void mlx4_cleanup_srq_table(struct mlx4_dev *dev);
 void mlx4_cleanup_mcg_table(struct mlx4_dev *dev);
-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn);
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs);
 void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn);
 int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn);
 void __mlx4_cq_free_icm(struct mlx4_dev *dev, int cqn);
@@ -890,7 +890,7 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn);
 void __mlx4_srq_free_icm(struct mlx4_dev *dev, int srqn);
 int __mlx4_mpt_reserve(struct mlx4_dev *dev);
 void __mlx4_mpt_release(struct mlx4_dev *dev, u32 index);
-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index);
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs);
 void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index);
 u32 __mlx4_alloc_mtt_range(struct mlx4_dev *dev, int order);
 void __mlx4_free_mtt_range(struct mlx4_dev *dev, u32 first_seg, int order);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 2483585..5fa9371 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -364,14 +364,14 @@ static void mlx4_mpt_release(struct mlx4_dev *dev, u32 index)
 	__mlx4_mpt_release(dev, index);
 }
 
-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
 {
 	struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table;
 
-	return mlx4_table_get(dev, &mr_table->dmpt_table, index);
+	return mlx4_table_get(dev, &mr_table->dmpt_table, index, use_gfp_nofs);
 }
 
-static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
 {
 	u64 param = 0;
 
@@ -382,7 +382,7 @@ static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
 							MLX4_CMD_TIME_CLASS_A,
 							MLX4_CMD_WRAPPED);
 	}
-	return __mlx4_mpt_alloc_icm(dev, index);
+	return __mlx4_mpt_alloc_icm(dev, index, use_gfp_nofs);
 }
 
 void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index)
@@ -469,7 +469,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
 	struct mlx4_mpt_entry *mpt_entry;
 	int err;
 
-	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key));
+	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key), 0);
 	if (err)
 		return err;
 
@@ -627,13 +627,14 @@ int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
 EXPORT_SYMBOL_GPL(mlx4_write_mtt);
 
 int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
-		       struct mlx4_buf *buf)
+		       struct mlx4_buf *buf, int use_gfp_nofs)
 {
 	u64 *page_list;
 	int err;
 	int i;
 
-	page_list = kmalloc(buf->npages * sizeof *page_list, GFP_KERNEL);
+	page_list = kmalloc(buf->npages * sizeof *page_list,
+			    use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 	if (!page_list)
 		return -ENOMEM;
 
@@ -680,7 +681,7 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw)
 	struct mlx4_mpt_entry *mpt_entry;
 	int err;
 
-	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key));
+	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key), 0);
 	if (err)
 		return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c
index 61d64eb..c6db326 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -272,29 +272,29 @@ void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt)
 }
 EXPORT_SYMBOL_GPL(mlx4_qp_release_range);
 
-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_qp_table *qp_table = &priv->qp_table;
 	int err;
 
-	err = mlx4_table_get(dev, &qp_table->qp_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->qp_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_out;
 
-	err = mlx4_table_get(dev, &qp_table->auxc_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->auxc_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_qp;
 
-	err = mlx4_table_get(dev, &qp_table->altc_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->altc_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_auxc;
 
-	err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_altc;
 
-	err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_rdmarc;
 
@@ -316,7 +316,7 @@ err_out:
 	return err;
 }
 
-static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
 {
 	u64 param = 0;
 
@@ -326,7 +326,7 @@ static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
 				    MLX4_CMD_ALLOC_RES, MLX4_CMD_TIME_CLASS_A,
 				    MLX4_CMD_WRAPPED);
 	}
-	return __mlx4_qp_alloc_icm(dev, qpn);
+	return __mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
 }
 
 void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
@@ -355,7 +355,8 @@ static void mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
 		__mlx4_qp_free_icm(dev, qpn);
 }
 
-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+		  int use_gfp_nofs)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_qp_table *qp_table = &priv->qp_table;
@@ -366,7 +367,7 @@ int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)
 
 	qp->qpn = qpn;
 
-	err = mlx4_qp_alloc_icm(dev, qpn);
+	err = mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
 	if (err)
 		return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 57428a0..007434d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -1490,7 +1490,7 @@ static int qp_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
 			return err;
 
 		if (!fw_reserved(dev, qpn)) {
-			err = __mlx4_qp_alloc_icm(dev, qpn);
+			err = __mlx4_qp_alloc_icm(dev, qpn, 0);
 			if (err) {
 				res_abort_move(dev, slave, RES_QP, qpn);
 				return err;
@@ -1577,7 +1577,7 @@ static int mpt_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
 		if (err)
 			return err;
 
-		err = __mlx4_mpt_alloc_icm(dev, mpt->key);
+		err = __mlx4_mpt_alloc_icm(dev, mpt->key, 0);
 		if (err) {
 			res_abort_move(dev, slave, RES_MPT, id);
 			return err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/srq.c b/drivers/net/ethernet/mellanox/mlx4/srq.c
index 98faf87..2cd51a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/srq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/srq.c
@@ -103,11 +103,11 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn)
 	if (*srqn == -1)
 		return -ENOMEM;
 
-	err = mlx4_table_get(dev, &srq_table->table, *srqn);
+	err = mlx4_table_get(dev, &srq_table->table, *srqn, 0);
 	if (err)
 		goto err_out;
 
-	err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn);
+	err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn, 0);
 	if (err)
 		goto err_put;
 	return 0;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 5edd2c6..de2fcf5 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -826,7 +826,7 @@ static inline int mlx4_is_slave(struct mlx4_dev *dev)
 }
 
 int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
-		   struct mlx4_buf *buf);
+		   struct mlx4_buf *buf, int use_gfp_nofs);
 void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf);
 static inline void *mlx4_buf_offset(struct mlx4_buf *buf, int offset)
 {
@@ -863,9 +863,10 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw);
 int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
 		   int start_index, int npages, u64 *page_list);
 int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
-		       struct mlx4_buf *buf);
+		       struct mlx4_buf *buf, int use_gfp_nofs);
 
-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order);
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+		  int use_gfp_nofs);
 void mlx4_db_free(struct mlx4_dev *dev, struct mlx4_db *db);
 
 int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
@@ -881,7 +882,8 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq);
 int mlx4_qp_reserve_range(struct mlx4_dev *dev, int cnt, int align, int *base);
 void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt);
 
-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp);
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+		  int use_gfp_nofs);
 void mlx4_qp_free(struct mlx4_dev *dev, struct mlx4_qp *qp);
 
 int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, u32 cqn, u16 xrcdn,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
       [not found] ` <CAJZOPZK4Ah+nKPWnX3=yM43jbf586GYJ+fh0-OL4bOnqKK8v8A@mail.gmail.com>
@ 2014-02-25 21:52   ` Or Gerlitz
  2014-02-25 22:11   ` Jiri Kosina
  1 sibling, 0 replies; 22+ messages in thread
From: Or Gerlitz @ 2014-02-25 21:52 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Fri, Feb 21, 2014 at 11:53 PM, Jiri Kosina <jkosina@suse.cz> wrote:
>>
>> This was originally a patch from Matthew Finlay <matt@mellanox.com> that
>> addressed a problem whereby NFS writes would enter uninterruptible sleep
>> forever.  The issue happened when using NFS over IPoIB. This is not a
>> recommended configuration as RDMA is preferred but it is still a valid
>> configuration and is important to have in situations where the NFS server
>> does not support RDMA. The problem encountered was described as follows:
>> .



And what happens if you use IPoIB datagram mode, is/why the patch is
needed there?


Also the patch uses a new QP creatiob flag IB_QP_CREATE_USE_GFP_NOFS
but it doesn't
touch include/rdma/ib_verbs.h nor I see this flag defined anywhere on
the patch, does it compile? how?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
       [not found] ` <CAJZOPZK4Ah+nKPWnX3=yM43jbf586GYJ+fh0-OL4bOnqKK8v8A@mail.gmail.com>
  2014-02-25 21:52   ` Or Gerlitz
@ 2014-02-25 22:11   ` Jiri Kosina
  2014-02-25 22:20     ` Or Gerlitz
  2014-03-05 19:46     ` Or Gerlitz
  1 sibling, 2 replies; 22+ messages in thread
From: Jiri Kosina @ 2014-02-25 22:11 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Tue, 25 Feb 2014, Or Gerlitz wrote:

> > This was originally a patch from Matthew Finlay <matt@mellanox.com> that
> > addressed a problem whereby NFS writes would enter uninterruptible sleep
> > forever.  The issue happened when using NFS over IPoIB. This is not a
> > recommended configuration as RDMA is preferred but it is still a valid
> > configuration and is important to have in situations where the NFS server
> > does not support RDMA. The problem encountered was described as follows:
> >
> And what happens if you use IPoIB datagram mode, is/why the patch is needed
> there?

First, thanks a lot for looking into this.

I admittedly am no infiniband expert, but my understanding is that in 
principle Connected/Datagram mode is about MTU and checksum offloading, 
but the TX path is the same. Please correct me if I am wrong.

> Also the patch uses a new QP creatiob flag IB_QP_CREATE_USE_GFP_NOFS but 
> it doesn't touch include/rdma/ib_verbs.h nor I see this flag defined 
> anywhere on the patch, does it compile? how?

That's my fault, I forgot 'git add', therefore my tree was building, but 
include/rdma/ib_verbs.h was missing in git index. Updated patch below, 
sorry for the noise.




From: Jiri Kosina <jkosina@suse.cz>
Subject: [PATCH] mlx4: Use all GFP_NOFS calls during the ipoib TX path when creating the QP

This was a patch from Matthew Finlay <matt@mellanox.com> that addressed a
problem whereby NFS writes would enter uninterruptible sleep forever.  The
issue happened when using NFS over IPoIB. This is not a recommended
configuration as RDMA is preferred but it is still a valid configuration and is
important to have in situations where the NFS server does not support RDMA.
The problem encountered was described as follows:

	It's not memory reclamation that is the problem as such. There is
	an indirect dependency between network filesystems writing back
	pages and ipoib_cm_tx_init() due to how a kworker is used. Page
	reclaim cannot make forward progress until ipoib_cm_tx_init()
	succeeds and it is stuck in page reclaim itself waiting for network
	transmission. Ordinarily this sitaution may be avoided by having
	the caller use GFP_NOFS but ipoib_cm_tx_init() does not have
	that information.

The patch has been ported to newer kernels by Mel Gorman and later
ported further by Jiri Kosina.
I'd like to get confirmation from Matt that he's fine with having
his Signed-off-by: on this, but he's not responding to my queries.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---
 drivers/infiniband/hw/mlx4/cq.c                    |    6 ++--
 drivers/infiniband/hw/mlx4/qp.c                    |   36 +++++++++++++++-----
 drivers/infiniband/hw/mlx4/srq.c                   |    7 ++--
 drivers/infiniband/ulp/ipoib/ipoib_cm.c            |   16 ++++++++-
 drivers/net/ethernet/mellanox/mlx4/alloc.c         |   35 ++++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/cq.c            |    4 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |    6 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |    2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.c           |   10 ++++--
 drivers/net/ethernet/mellanox/mlx4/icm.h           |    3 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h          |    4 +-
 drivers/net/ethernet/mellanox/mlx4/mr.c            |   17 +++++----
 drivers/net/ethernet/mellanox/mlx4/qp.c            |   21 ++++++-----
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    4 +-
 drivers/net/ethernet/mellanox/mlx4/srq.c           |    4 +-
 include/linux/mlx4/device.h                        |   10 +++--
 include/rdma/ib_verbs.h                            |    1 +
 17 files changed, 118 insertions(+), 68 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index cc40f08..661185a 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -102,7 +102,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
 	int err;
 
 	err = mlx4_buf_alloc(dev->dev, nent * dev->dev->caps.cqe_size,
-			     PAGE_SIZE * 2, &buf->buf);
+			     PAGE_SIZE * 2, &buf->buf, 0);
 
 	if (err)
 		goto out;
@@ -113,7 +113,7 @@ static int mlx4_ib_alloc_cq_buf(struct mlx4_ib_dev *dev, struct mlx4_ib_cq_buf *
 	if (err)
 		goto err_buf;
 
-	err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf);
+	err = mlx4_buf_write_mtt(dev->dev, &buf->mtt, &buf->buf, 0);
 	if (err)
 		goto err_mtt;
 
@@ -209,7 +209,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
 
 		uar = &to_mucontext(context)->uar;
 	} else {
-		err = mlx4_db_alloc(dev->dev, &cq->db, 1);
+		err = mlx4_db_alloc(dev->dev, &cq->db, 1, 0);
 		if (err)
 			goto err_cq;
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index d8f4d1f..1379ee7 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -744,14 +744,18 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			goto err;
 
 		if (qp_has_rq(init_attr)) {
-			err = mlx4_db_alloc(dev->dev, &qp->db, 0);
+			err = mlx4_db_alloc(dev->dev, &qp->db, 0,
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS);
 			if (err)
 				goto err;
 
 			*qp->db.db = 0;
 		}
 
-		if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf)) {
+		if (mlx4_buf_alloc(dev->dev, qp->buf_size, PAGE_SIZE * 2, &qp->buf,
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS)) {
 			err = -ENOMEM;
 			goto err_db;
 		}
@@ -761,12 +765,20 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 		if (err)
 			goto err_buf;
 
-		err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf);
+		err = mlx4_buf_write_mtt(dev->dev, &qp->mtt, &qp->buf,
+				init_attr->create_flags &
+					IB_QP_CREATE_USE_GFP_NOFS);
 		if (err)
 			goto err_mtt;
 
-		qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL);
-		qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL);
+		qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64),
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS ?
+							GFP_NOFS : GFP_KERNEL);
+		qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64),
+					init_attr->create_flags &
+						IB_QP_CREATE_USE_GFP_NOFS ?
+							GFP_NOFS : GFP_KERNEL);
 
 		if (!qp->sq.wrid || !qp->rq.wrid) {
 			err = -ENOMEM;
@@ -797,7 +809,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			goto err_proxy;
 	}
 
-	err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp);
+	err = mlx4_qp_alloc(dev->dev, qpn, &qp->mqp,
+			init_attr->create_flags & IB_QP_CREATE_USE_GFP_NOFS);
 	if (err)
 		goto err_qpn;
 
@@ -1024,10 +1037,12 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 					MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK |
 					MLX4_IB_SRIOV_TUNNEL_QP |
 					MLX4_IB_SRIOV_SQP |
-					MLX4_IB_QP_NETIF))
+					MLX4_IB_QP_NETIF |
+					IB_QP_CREATE_USE_GFP_NOFS))
 		return ERR_PTR(-EINVAL);
 
-	if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP) {
+	if (init_attr->create_flags & IB_QP_CREATE_NETIF_QP &&
+	    init_attr->create_flags & ~IB_QP_CREATE_USE_GFP_NOFS) {
 		if (init_attr->qp_type != IB_QPT_UD)
 			return ERR_PTR(-EINVAL);
 	}
@@ -1054,7 +1069,10 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	case IB_QPT_RC:
 	case IB_QPT_UC:
 	case IB_QPT_RAW_PACKET:
-		qp = kzalloc(sizeof *qp, GFP_KERNEL);
+		qp = kzalloc(sizeof *qp,
+				init_attr->create_flags &
+				IB_QP_CREATE_USE_GFP_NOFS ?
+					GFP_NOFS : GFP_KERNEL);
 		if (!qp)
 			return ERR_PTR(-ENOMEM);
 		/* fall through */
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 60c5fb0..17552c0 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -134,13 +134,14 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
 		if (err)
 			goto err_mtt;
 	} else {
-		err = mlx4_db_alloc(dev->dev, &srq->db, 0);
+		err = mlx4_db_alloc(dev->dev, &srq->db, 0, 0);
 		if (err)
 			goto err_srq;
 
 		*srq->db.db = 0;
 
-		if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2, &srq->buf)) {
+		if (mlx4_buf_alloc(dev->dev, buf_size, PAGE_SIZE * 2,
+					&srq->buf, 0)) {
 			err = -ENOMEM;
 			goto err_db;
 		}
@@ -165,7 +166,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
 		if (err)
 			goto err_buf;
 
-		err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf);
+		err = mlx4_buf_write_mtt(dev->dev, &srq->mtt, &srq->buf, 0);
 		if (err)
 			goto err_mtt;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 1377f85..b6dd279 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -48,6 +48,13 @@ MODULE_PARM_DESC(max_nonsrq_conn_qp,
 		 "Max number of connected-mode QPs per interface "
 		 "(applied only if shared receive queue is not available)");
 
+int ipoib_use_gfp_nofs = 0;
+
+module_param_named(use_gfp_nofs, ipoib_use_gfp_nofs, int, 0444);
+MODULE_PARM_DESC(use_gfp_nofs,
+		 "Use GFP_NOFS flags when allocating memory during the TX "
+		 "path for CM.  This should be used when running NFS over IPoIB.");
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
 static int data_debug_level;
 
@@ -1030,7 +1037,9 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 		.cap.max_send_sge	= 1,
 		.sq_sig_type		= IB_SIGNAL_ALL_WR,
 		.qp_type		= IB_QPT_RC,
-		.qp_context		= tx
+		.qp_context		= tx,
+		.create_flags		= ipoib_use_gfp_nofs ?
+						IB_QP_CREATE_USE_GFP_NOFS : 0
 	};
 
 	return ib_create_qp(priv->pd, &attr);
@@ -1104,12 +1113,15 @@ static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn,
 	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
 	int ret;
 
-	p->tx_ring = vzalloc(ipoib_sendq_size * sizeof *p->tx_ring);
+	p->tx_ring = __vmalloc(ipoib_sendq_size * sizeof *p->tx_ring,
+				ipoib_use_gfp_nofs ? GFP_NOFS : GFP_KERNEL,
+				PAGE_KERNEL);
 	if (!p->tx_ring) {
 		ipoib_warn(priv, "failed to allocate tx ring\n");
 		ret = -ENOMEM;
 		goto err_tx;
 	}
+	memset(p->tx_ring, 0, ipoib_sendq_size * sizeof *p->tx_ring);
 
 	p->qp = ipoib_cm_create_tx_qp(p->dev, p);
 	if (IS_ERR(p->qp)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c b/drivers/net/ethernet/mellanox/mlx4/alloc.c
index c3ad464..60ca7f1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -171,7 +171,7 @@ void mlx4_bitmap_cleanup(struct mlx4_bitmap *bitmap)
  */
 
 int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
-		   struct mlx4_buf *buf)
+		   struct mlx4_buf *buf, int use_gfp_nofs)
 {
 	dma_addr_t t;
 
@@ -180,7 +180,9 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 		buf->npages       = 1;
 		buf->page_shift   = get_order(size) + PAGE_SHIFT;
 		buf->direct.buf   = dma_alloc_coherent(&dev->pdev->dev,
-						       size, &t, GFP_KERNEL);
+						       size, &t,
+						       use_gfp_nofs ?
+							GFP_NOFS : GFP_KERNEL);
 		if (!buf->direct.buf)
 			return -ENOMEM;
 
@@ -200,14 +202,16 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 		buf->npages      = buf->nbufs;
 		buf->page_shift  = PAGE_SHIFT;
 		buf->page_list   = kcalloc(buf->nbufs, sizeof(*buf->page_list),
-					   GFP_KERNEL);
+					   use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 		if (!buf->page_list)
 			return -ENOMEM;
 
 		for (i = 0; i < buf->nbufs; ++i) {
 			buf->page_list[i].buf =
 				dma_alloc_coherent(&dev->pdev->dev, PAGE_SIZE,
-						   &t, GFP_KERNEL);
+						   &t,
+						   use_gfp_nofs ?
+							GFP_NOFS : GFP_KERNEL);
 			if (!buf->page_list[i].buf)
 				goto err_free;
 
@@ -218,7 +222,8 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
 
 		if (BITS_PER_LONG == 64) {
 			struct page **pages;
-			pages = kmalloc(sizeof *pages * buf->nbufs, GFP_KERNEL);
+			pages = kmalloc(sizeof *pages * buf->nbufs,
+					use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 			if (!pages)
 				goto err_free;
 			for (i = 0; i < buf->nbufs; ++i)
@@ -260,11 +265,12 @@ void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf)
 }
 EXPORT_SYMBOL_GPL(mlx4_buf_free);
 
-static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
+static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device,
+						 int use_gfp_nofs)
 {
 	struct mlx4_db_pgdir *pgdir;
 
-	pgdir = kzalloc(sizeof *pgdir, GFP_KERNEL);
+	pgdir = kzalloc(sizeof *pgdir, use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 	if (!pgdir)
 		return NULL;
 
@@ -272,7 +278,9 @@ static struct mlx4_db_pgdir *mlx4_alloc_db_pgdir(struct device *dma_device)
 	pgdir->bits[0] = pgdir->order0;
 	pgdir->bits[1] = pgdir->order1;
 	pgdir->db_page = dma_alloc_coherent(dma_device, PAGE_SIZE,
-					    &pgdir->db_dma, GFP_KERNEL);
+					    &pgdir->db_dma,
+					    use_gfp_nofs ?
+						GFP_NOFS : GFP_KERNEL);
 	if (!pgdir->db_page) {
 		kfree(pgdir);
 		return NULL;
@@ -312,7 +320,8 @@ found:
 	return 0;
 }
 
-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+		  int use_gfp_nofs)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_db_pgdir *pgdir;
@@ -324,7 +333,7 @@ int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order)
 		if (!mlx4_alloc_db_from_pgdir(pgdir, db, order))
 			goto out;
 
-	pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev));
+	pgdir = mlx4_alloc_db_pgdir(&(dev->pdev->dev), use_gfp_nofs);
 	if (!pgdir) {
 		ret = -ENOMEM;
 		goto out;
@@ -376,13 +385,13 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
 {
 	int err;
 
-	err = mlx4_db_alloc(dev, &wqres->db, 1);
+	err = mlx4_db_alloc(dev, &wqres->db, 1, 0);
 	if (err)
 		return err;
 
 	*wqres->db.db = 0;
 
-	err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf);
+	err = mlx4_buf_alloc(dev, size, max_direct, &wqres->buf, 0);
 	if (err)
 		goto err_db;
 
@@ -391,7 +400,7 @@ int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
 	if (err)
 		goto err_buf;
 
-	err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf);
+	err = mlx4_buf_write_mtt(dev, &wqres->mtt, &wqres->buf, 0);
 	if (err)
 		goto err_mtt;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 0487121..9727175 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -173,11 +173,11 @@ int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn)
 	if (*cqn == -1)
 		return -ENOMEM;
 
-	err = mlx4_table_get(dev, &cq_table->table, *cqn);
+	err = mlx4_table_get(dev, &cq_table->table, *cqn, 0);
 	if (err)
 		goto err_out;
 
-	err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn);
+	err = mlx4_table_get(dev, &cq_table->cmpt_table, *cqn, 0);
 	if (err)
 		goto err_put;
 	return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 890922c..ee77284 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -944,7 +944,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn,
 	if (!context)
 		return -ENOMEM;
 
-	err = mlx4_qp_alloc(mdev->dev, qpn, qp);
+	err = mlx4_qp_alloc(mdev->dev, qpn, qp, 0);
 	if (err) {
 		en_err(priv, "Failed to allocate qp #%x\n", qpn);
 		goto out;
@@ -984,7 +984,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv)
 		en_err(priv, "Failed reserving drop qpn\n");
 		return err;
 	}
-	err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp);
+	err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp, 0);
 	if (err) {
 		en_err(priv, "Failed allocating drop qp\n");
 		mlx4_qp_release_range(priv->mdev->dev, qpn, 1);
@@ -1043,7 +1043,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 	}
 
 	/* Configure RSS indirection qp */
-	err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp);
+	err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp, 0);
 	if (err) {
 		en_err(priv, "Failed to allocate RSS indirection QP\n");
 		goto rss_err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 1345703..2f123bf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -124,7 +124,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	       ring->buf_size, (unsigned long long) ring->wqres.buf.direct.map);
 
 	ring->qpn = qpn;
-	err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp);
+	err = mlx4_qp_alloc(mdev->dev, ring->qpn, &ring->qp, 0);
 	if (err) {
 		en_err(priv, "Failed allocating qp %d\n", ring->qpn);
 		goto err_map;
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 5fbf492..c83b4e6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -245,7 +245,8 @@ int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev)
 			MLX4_CMD_TIME_CLASS_B, MLX4_CMD_NATIVE);
 }
 
-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+		   int use_gfp_nofs)
 {
 	u32 i = (obj & (table->num_obj - 1)) /
 			(MLX4_TABLE_CHUNK_SIZE / table->obj_size);
@@ -259,7 +260,10 @@ int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj)
 	}
 
 	table->icm[i] = mlx4_alloc_icm(dev, MLX4_TABLE_CHUNK_SIZE >> PAGE_SHIFT,
-				       (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
+				       (table->lowmem ?
+						(use_gfp_nofs ?
+							GFP_NOFS : GFP_KERNEL) :
+					GFP_HIGHUSER) |
 				       __GFP_NOWARN, table->coherent);
 	if (!table->icm[i]) {
 		ret = -ENOMEM;
@@ -356,7 +360,7 @@ int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 	u32 i;
 
 	for (i = start; i <= end; i += inc) {
-		err = mlx4_table_get(dev, table, i);
+		err = mlx4_table_get(dev, table, i, 0);
 		if (err)
 			goto fail;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.h b/drivers/net/ethernet/mellanox/mlx4/icm.h
index dee67fa..2be6ac5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.h
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.h
@@ -71,7 +71,8 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages,
 				gfp_t gfp_mask, int coherent);
 void mlx4_free_icm(struct mlx4_dev *dev, struct mlx4_icm *icm, int coherent);
 
-int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
+int mlx4_table_get(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj,
+		   int use_gfp_nofs);
 void mlx4_table_put(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 obj);
 int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 			 u32 start, u32 end);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index 6b65f77..2d73e12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -882,7 +882,7 @@ void mlx4_cleanup_cq_table(struct mlx4_dev *dev);
 void mlx4_cleanup_qp_table(struct mlx4_dev *dev);
 void mlx4_cleanup_srq_table(struct mlx4_dev *dev);
 void mlx4_cleanup_mcg_table(struct mlx4_dev *dev);
-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn);
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs);
 void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn);
 int __mlx4_cq_alloc_icm(struct mlx4_dev *dev, int *cqn);
 void __mlx4_cq_free_icm(struct mlx4_dev *dev, int cqn);
@@ -890,7 +890,7 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn);
 void __mlx4_srq_free_icm(struct mlx4_dev *dev, int srqn);
 int __mlx4_mpt_reserve(struct mlx4_dev *dev);
 void __mlx4_mpt_release(struct mlx4_dev *dev, u32 index);
-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index);
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs);
 void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index);
 u32 __mlx4_alloc_mtt_range(struct mlx4_dev *dev, int order);
 void __mlx4_free_mtt_range(struct mlx4_dev *dev, u32 first_seg, int order);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 2483585..5fa9371 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -364,14 +364,14 @@ static void mlx4_mpt_release(struct mlx4_dev *dev, u32 index)
 	__mlx4_mpt_release(dev, index);
 }
 
-int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+int __mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
 {
 	struct mlx4_mr_table *mr_table = &mlx4_priv(dev)->mr_table;
 
-	return mlx4_table_get(dev, &mr_table->dmpt_table, index);
+	return mlx4_table_get(dev, &mr_table->dmpt_table, index, use_gfp_nofs);
 }
 
-static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
+static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index, int use_gfp_nofs)
 {
 	u64 param = 0;
 
@@ -382,7 +382,7 @@ static int mlx4_mpt_alloc_icm(struct mlx4_dev *dev, u32 index)
 							MLX4_CMD_TIME_CLASS_A,
 							MLX4_CMD_WRAPPED);
 	}
-	return __mlx4_mpt_alloc_icm(dev, index);
+	return __mlx4_mpt_alloc_icm(dev, index, use_gfp_nofs);
 }
 
 void __mlx4_mpt_free_icm(struct mlx4_dev *dev, u32 index)
@@ -469,7 +469,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr)
 	struct mlx4_mpt_entry *mpt_entry;
 	int err;
 
-	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key));
+	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mr->key), 0);
 	if (err)
 		return err;
 
@@ -627,13 +627,14 @@ int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
 EXPORT_SYMBOL_GPL(mlx4_write_mtt);
 
 int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
-		       struct mlx4_buf *buf)
+		       struct mlx4_buf *buf, int use_gfp_nofs)
 {
 	u64 *page_list;
 	int err;
 	int i;
 
-	page_list = kmalloc(buf->npages * sizeof *page_list, GFP_KERNEL);
+	page_list = kmalloc(buf->npages * sizeof *page_list,
+			    use_gfp_nofs ? GFP_NOFS : GFP_KERNEL);
 	if (!page_list)
 		return -ENOMEM;
 
@@ -680,7 +681,7 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw)
 	struct mlx4_mpt_entry *mpt_entry;
 	int err;
 
-	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key));
+	err = mlx4_mpt_alloc_icm(dev, key_to_hw_index(mw->key), 0);
 	if (err)
 		return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c
index 61d64eb..c6db326 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -272,29 +272,29 @@ void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt)
 }
 EXPORT_SYMBOL_GPL(mlx4_qp_release_range);
 
-int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+int __mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_qp_table *qp_table = &priv->qp_table;
 	int err;
 
-	err = mlx4_table_get(dev, &qp_table->qp_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->qp_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_out;
 
-	err = mlx4_table_get(dev, &qp_table->auxc_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->auxc_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_qp;
 
-	err = mlx4_table_get(dev, &qp_table->altc_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->altc_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_auxc;
 
-	err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->rdmarc_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_altc;
 
-	err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn);
+	err = mlx4_table_get(dev, &qp_table->cmpt_table, qpn, use_gfp_nofs);
 	if (err)
 		goto err_put_rdmarc;
 
@@ -316,7 +316,7 @@ err_out:
 	return err;
 }
 
-static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
+static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn, int use_gfp_nofs)
 {
 	u64 param = 0;
 
@@ -326,7 +326,7 @@ static int mlx4_qp_alloc_icm(struct mlx4_dev *dev, int qpn)
 				    MLX4_CMD_ALLOC_RES, MLX4_CMD_TIME_CLASS_A,
 				    MLX4_CMD_WRAPPED);
 	}
-	return __mlx4_qp_alloc_icm(dev, qpn);
+	return __mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
 }
 
 void __mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
@@ -355,7 +355,8 @@ static void mlx4_qp_free_icm(struct mlx4_dev *dev, int qpn)
 		__mlx4_qp_free_icm(dev, qpn);
 }
 
-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+		  int use_gfp_nofs)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_qp_table *qp_table = &priv->qp_table;
@@ -366,7 +367,7 @@ int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp)
 
 	qp->qpn = qpn;
 
-	err = mlx4_qp_alloc_icm(dev, qpn);
+	err = mlx4_qp_alloc_icm(dev, qpn, use_gfp_nofs);
 	if (err)
 		return err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 57428a0..007434d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -1490,7 +1490,7 @@ static int qp_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
 			return err;
 
 		if (!fw_reserved(dev, qpn)) {
-			err = __mlx4_qp_alloc_icm(dev, qpn);
+			err = __mlx4_qp_alloc_icm(dev, qpn, 0);
 			if (err) {
 				res_abort_move(dev, slave, RES_QP, qpn);
 				return err;
@@ -1577,7 +1577,7 @@ static int mpt_alloc_res(struct mlx4_dev *dev, int slave, int op, int cmd,
 		if (err)
 			return err;
 
-		err = __mlx4_mpt_alloc_icm(dev, mpt->key);
+		err = __mlx4_mpt_alloc_icm(dev, mpt->key, 0);
 		if (err) {
 			res_abort_move(dev, slave, RES_MPT, id);
 			return err;
diff --git a/drivers/net/ethernet/mellanox/mlx4/srq.c b/drivers/net/ethernet/mellanox/mlx4/srq.c
index 98faf87..2cd51a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/srq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/srq.c
@@ -103,11 +103,11 @@ int __mlx4_srq_alloc_icm(struct mlx4_dev *dev, int *srqn)
 	if (*srqn == -1)
 		return -ENOMEM;
 
-	err = mlx4_table_get(dev, &srq_table->table, *srqn);
+	err = mlx4_table_get(dev, &srq_table->table, *srqn, 0);
 	if (err)
 		goto err_out;
 
-	err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn);
+	err = mlx4_table_get(dev, &srq_table->cmpt_table, *srqn, 0);
 	if (err)
 		goto err_put;
 	return 0;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 5edd2c6..de2fcf5 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -826,7 +826,7 @@ static inline int mlx4_is_slave(struct mlx4_dev *dev)
 }
 
 int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
-		   struct mlx4_buf *buf);
+		   struct mlx4_buf *buf, int use_gfp_nofs);
 void mlx4_buf_free(struct mlx4_dev *dev, int size, struct mlx4_buf *buf);
 static inline void *mlx4_buf_offset(struct mlx4_buf *buf, int offset)
 {
@@ -863,9 +863,10 @@ int mlx4_mw_enable(struct mlx4_dev *dev, struct mlx4_mw *mw);
 int mlx4_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
 		   int start_index, int npages, u64 *page_list);
 int mlx4_buf_write_mtt(struct mlx4_dev *dev, struct mlx4_mtt *mtt,
-		       struct mlx4_buf *buf);
+		       struct mlx4_buf *buf, int use_gfp_nofs);
 
-int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order);
+int mlx4_db_alloc(struct mlx4_dev *dev, struct mlx4_db *db, int order,
+		  int use_gfp_nofs);
 void mlx4_db_free(struct mlx4_dev *dev, struct mlx4_db *db);
 
 int mlx4_alloc_hwq_res(struct mlx4_dev *dev, struct mlx4_hwq_resources *wqres,
@@ -881,7 +882,8 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq);
 int mlx4_qp_reserve_range(struct mlx4_dev *dev, int cnt, int align, int *base);
 void mlx4_qp_release_range(struct mlx4_dev *dev, int base_qpn, int cnt);
 
-int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp);
+int mlx4_qp_alloc(struct mlx4_dev *dev, int qpn, struct mlx4_qp *qp,
+		  int use_gfp_nofs);
 void mlx4_qp_free(struct mlx4_dev *dev, struct mlx4_qp *qp);
 
 int mlx4_srq_alloc(struct mlx4_dev *dev, u32 pdn, u32 cqn, u16 xrcdn,
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6793f32..f39001c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -643,6 +643,7 @@ enum ib_qp_type {
 enum ib_qp_create_flags {
 	IB_QP_CREATE_IPOIB_UD_LSO		= 1 << 0,
 	IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK	= 1 << 1,
+	IB_QP_CREATE_USE_GFP_NOFS		= 1 << 2,
 	IB_QP_CREATE_NETIF_QP			= 1 << 5,
 	/* reserve bits 26-31 for low level drivers' internal use */
 	IB_QP_CREATE_RESERVED_START		= 1 << 26,
-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-25 22:11   ` Jiri Kosina
@ 2014-02-25 22:20     ` Or Gerlitz
  2014-02-25 22:40       ` Jiri Kosina
  2014-03-05 19:46     ` Or Gerlitz
  1 sibling, 1 reply; 22+ messages in thread
From: Or Gerlitz @ 2014-02-25 22:20 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Wed, Feb 26, 2014 at 12:11 AM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Tue, 25 Feb 2014, Or Gerlitz wrote:

>> And what happens if you use IPoIB datagram mode, is/why the patch is
>> needed there?

> I admittedly am no infiniband expert, but my understanding is that in
> principle Connected/Datagram mode is about MTU and checksum offloading,

yes, the differences between the mode relate to these aspects, however

> but the TX path is the same. Please correct me if I am wrong.

no, note that your patch only touched drivers/infiniband/ulp/ipoib/ipoib_cm.c
which is basically compiled out if you set CONFIG_INFINIBAND_IPOIB_CM,
so surely the TX path for the datagram vs. connected modes are
different.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-25 22:20     ` Or Gerlitz
@ 2014-02-25 22:40       ` Jiri Kosina
  2014-02-25 22:48         ` Or Gerlitz
  0 siblings, 1 reply; 22+ messages in thread
From: Jiri Kosina @ 2014-02-25 22:40 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Wed, 26 Feb 2014, Or Gerlitz wrote:

> >> And what happens if you use IPoIB datagram mode, is/why the patch is
> >> needed there?
> 
> > I admittedly am no infiniband expert, but my understanding is that in
> > principle Connected/Datagram mode is about MTU and checksum offloading,
> 
> yes, the differences between the mode relate to these aspects, however

Thanks for confirming.

> > but the TX path is the same. Please correct me if I am wrong.
> 
> no, note that your patch only touched drivers/infiniband/ulp/ipoib/ipoib_cm.c
> which is basically compiled out if you set CONFIG_INFINIBAND_IPOIB_CM,
> so surely the TX path for the datagram vs. connected modes are
> different.

Yes, but for datagram mode, the tx_ring is allocated in a completely 
different way (not from kworker), so this might be a non-issue, right? I 
will have to look into it more deeply to be really sure; if you can 
provide your insight, that'd be helpful.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-25 22:40       ` Jiri Kosina
@ 2014-02-25 22:48         ` Or Gerlitz
  2014-02-25 22:55           ` Jiri Kosina
  0 siblings, 1 reply; 22+ messages in thread
From: Or Gerlitz @ 2014-02-25 22:48 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Wed, Feb 26, 2014 at 12:40 AM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Wed, 26 Feb 2014, Or Gerlitz wrote:
>
>>>> And what happens if you use IPoIB datagram mode, is/why the patch is
>>>> needed there?

>>> I admittedly am no infiniband expert, but my understanding is that in
>>> principle Connected/Datagram mode is about MTU and checksum
>>> offloading


>> yes, the differences between the mode relate to these aspects, however

> Thanks for confirming

Still, even if different, I still don't see why not use datagram mode
if the problem hits you only for connected mode. E.g datagram mode
supports LSO/GRO and TX/RX checksum offloads which should cover on the
smaller MTU vs. connected mode


>> > but the TX path is the same. Please correct me if I am wrong.

>> no, note that your patch only touched drivers/infiniband/ulp/ipoib/ipoib_cm.c
>> which is basically compiled out if you set CONFIG_INFINIBAND_IPOIB_CM,
>> so surely the TX path for the datagram vs. connected modes are
>> different.

> Yes, but for datagram mode, the tx_ring is allocated in a completely
> different way (not from kworker), so this might be a non-issue, right? I
> will have to look into it more deeply to be really sure; if you can
> provide your insight, that'd be helpful.


Note that even when operating in connected mode, the ipoib net-device
instance can speak in datagram mode with remote nodes who don't
support connected mor and/or when sending multicast -- specifically
ipoib_dev_init() does the setup of the TX ring. Maybe you can just try
this out and see if it works?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-25 22:48         ` Or Gerlitz
@ 2014-02-25 22:55           ` Jiri Kosina
  0 siblings, 0 replies; 22+ messages in thread
From: Jiri Kosina @ 2014-02-25 22:55 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Wed, 26 Feb 2014, Or Gerlitz wrote:

> Still, even if different, I still don't see why not use datagram mode
> if the problem hits you only for connected mode. E.g datagram mode
> supports LSO/GRO and TX/RX checksum offloads which should cover on the
> smaller MTU vs. connected mode
[ ... snip ... ]
> > Yes, but for datagram mode, the tx_ring is allocated in a completely
> > different way (not from kworker), so this might be a non-issue, right? I
> > will have to look into it more deeply to be really sure; if you can
> > provide your insight, that'd be helpful.
> 
> Note that even when operating in connected mode, the ipoib net-device
> instance can speak in datagram mode with remote nodes who don't
> support connected mor and/or when sending multicast -- specifically
> ipoib_dev_init() does the setup of the TX ring. Maybe you can just try
> this out and see if it works?

That definitely can be verified, and I am putting it on my TODO list.

But let's make sure that we don't diverge from the original problem too 
much. Simple fact is that the deadlock is there when using connected mode, 
and there is nothing preventing users from using it this way, therefore I 
believe it should be fixed one way or another.

If the problem is still there in datagram mode (which, as far as my 
understanding of the code goes, is not the case), it should be fixed as 
well, but that's a different story.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-25 22:11   ` Jiri Kosina
  2014-02-25 22:20     ` Or Gerlitz
@ 2014-03-05 19:46     ` Or Gerlitz
  1 sibling, 0 replies; 22+ messages in thread
From: Or Gerlitz @ 2014-03-05 19:46 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Wed, Feb 26, 2014 at 12:11 AM, Jiri Kosina <jkosina@suse.cz> wrote:
> The problem encountered was described as follows:
>         It's not memory reclamation that is the problem as such. There is
>         an indirect dependency between network filesystems writing back
>         pages and ipoib_cm_tx_init() due to how a kworker is used. Page
>         reclaim cannot make forward progress until ipoib_cm_tx_init()
>         succeeds and it is stuck in page reclaim itself waiting for network
>         transmission. Ordinarily this sitaution may be avoided by having
>         the caller use GFP_NOFS but ipoib_cm_tx_init() does not have
>         that information.

So to hit the bug, one just needs to attempt doing NFS client mount
over an IP subnet served by IPoIB NIC that uses connected-mode and
runs over mlx4 device?  Or this happens when  the connection is going
through teardown/re-establishment or something else?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-21 21:53 [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP Jiri Kosina
       [not found] ` <CAJZOPZK4Ah+nKPWnX3=yM43jbf586GYJ+fh0-OL4bOnqKK8v8A@mail.gmail.com>
@ 2014-03-06 13:31 ` Or Gerlitz
  2014-03-06 13:47   ` Jiri Kosina
  1 sibling, 1 reply; 22+ messages in thread
From: Or Gerlitz @ 2014-03-06 13:31 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Eugenia Emantayev,
	David S. Miller, Mel Gorman, netdev, linux-kernel,
	Saeed Mahameed, Sagi Grimberg, Shlomo Pongratz

On 21/02/2014 23:53, Jiri Kosina wrote:
> This was originally a patch from Matthew Finlay<matt@mellanox.com>  that
> addressed a problem whereby NFS writes would enter uninterruptible sleep
> forever.  The issue happened when using NFS over IPoIB. This is not a
> recommended configuration as RDMA is preferred but it is still a valid
> configuration and is important to have in situations where the NFS server
> does not support RDMA. The problem encountered was described as follows:
>
> 	It's not memory reclamation that is the problem as such. There is
> 	an indirect dependency between network filesystems writing back
> 	pages and ipoib_cm_tx_init() due to how a kworker is used. Page
> 	reclaim cannot make forward progress until ipoib_cm_tx_init()
> 	succeeds and it is stuck in page reclaim itself waiting for network
> 	transmission. Ordinarily this sitaution may be avoided by having
> 	the caller use GFP_NOFS but ipoib_cm_tx_init() does not have that information.
>

Hi Jiri,

Reading again (*) the problem description, the team here would be happy 
to clarify with you some details (possibly
few MM newbie questions, but it will help us):

1. just to make sure, the problem happen on the NFS client, not the NFS 
server, right? so writing-back means client
writing over the NFS mount --> network

2. you wrote "due to how a kworker is used", can you clarify if/why 
things go wrong b/c of the kworker usage, or this is matter of phrasing?

in earlier post over this thread you wrote "There was a problem with 
swapping over NFS, as writeback was deadlocked with memory reclaim 
(memory needs to be allocated so that > swap could be accessed to 
reclaim memory). That's fixed by allocating the buffers from PF_MEMALLOC 
reserve, introduced by Mel's and Peter's patchset back in 3.9 or so. Oh, 
and the same has been done for swapping over NBD, btw", in that respect:

3. you mentioned that the memory allocations in ipoib_cm_tx_init() and 
ib_create_qp() --> mlx4 driver requires
page reclaim and waits for network transmission, so this client node put 
their swap over that NFS partition?

4. Can you shed more light, why the problem hits also for kmalloc based 
allocations and not only for vmalloc
based allocation e.g not only b/c of the vzalloc call in 
ipoib_cm_tx_init but rather also b/c of misc kmalloc calls within
the HW (here mlx4) driver?

thanks,

Or.

(*) and sorry for my stupid question from yesterday, sometimes it's bad 
idea to ask questions on mailing lists when you are very tired

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-03-06 13:31 ` Or Gerlitz
@ 2014-03-06 13:47   ` Jiri Kosina
  0 siblings, 0 replies; 22+ messages in thread
From: Jiri Kosina @ 2014-03-06 13:47 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Eugenia Emantayev,
	David S. Miller, Mel Gorman, netdev, linux-kernel,
	Saeed Mahameed, Sagi Grimberg, Shlomo Pongratz

On Thu, 6 Mar 2014, Or Gerlitz wrote:

> > This was originally a patch from Matthew Finlay<matt@mellanox.com>  that
> > addressed a problem whereby NFS writes would enter uninterruptible sleep
> > forever.  The issue happened when using NFS over IPoIB. This is not a
> > recommended configuration as RDMA is preferred but it is still a valid
> > configuration and is important to have in situations where the NFS server
> > does not support RDMA. The problem encountered was described as follows:
> > 
> > 	It's not memory reclamation that is the problem as such. There is
> > 	an indirect dependency between network filesystems writing back
> > 	pages and ipoib_cm_tx_init() due to how a kworker is used. Page
> > 	reclaim cannot make forward progress until ipoib_cm_tx_init()
> > 	succeeds and it is stuck in page reclaim itself waiting for network
> > 	transmission. Ordinarily this sitaution may be avoided by having
> > 	the caller use GFP_NOFS but ipoib_cm_tx_init() does not have that
> > information.
> > 
> 
> Hi Jiri,
>
> Reading again (*) the problem description, the team here would be happy 
> to clarify with you some details (possibly few MM newbie questions, but 
> it will help us):

Hi Or,

thanks for getting back to me. I am sure there are better people to ask 
MM-related questions, but here we go.

Oh, and by the way, the very original version of the patch is coming from 
a Mellanox employee Matthew Finlay, so perhaps it might be much more 
efficient if you would be able to contact him and discuss the details with 
him.

> 1. just to make sure, the problem happen on the NFS client, not the NFS 
> server, right? so writing-back means client writing over the NFS mount 
> --> network

Yes, that is the case.

> 2. you wrote "due to how a kworker is used", can you clarify if/why things go
> wrong b/c of the kworker usage, or this is matter of phrasing?

The mlx kworker trying to allocate memory with GFP_KERNEL will eventually 
get stuck; if the system is under memory pressure, performing memory 
reclaim is needed in order to free occupied memory and use it for the 
GFP_KERNEL allocation.

Writeback can't however proceed, as the mlx kworker is stuck waiting 
exactly on the writeback to eventually happen.

> in earlier post over this thread you wrote "There was a problem with swapping
> over NFS, as writeback was deadlocked with memory reclaim (memory needs to be
> allocated so that > swap could be accessed to reclaim memory). That's fixed by
> allocating the buffers from PF_MEMALLOC reserve, introduced by Mel's and
> Peter's patchset back in 3.9 or so. Oh, and the same has been done for
> swapping over NBD, btw", in that respect:
>
> 3. you mentioned that the memory allocations in ipoib_cm_tx_init() and 
> ib_create_qp() --> mlx4 driver requires page reclaim and waits for 
> network transmission, so this client node put their swap over that NFS 
> partition?

They need memory reclaim to happen in low-memory situations. GFP_KERNEL 
allocation is allowed to go to sleep and wait for the reclaim to succeed.

> 4. Can you shed more light, why the problem hits also for kmalloc based 
> allocations and not only for vmalloc based allocation e.g not only b/c 
> of the vzalloc call in ipoib_cm_tx_init but rather also b/c of misc 
> kmalloc calls within the HW (here mlx4) driver?

The GFP_KERNEL is the key here -- allocation using GFP_KERNEL allocation 
is allowed to sleep until memory reclamation has succeeded.

Thanks again,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-04-24 20:01             ` Or Gerlitz
@ 2014-05-02 13:03               ` Jiri Kosina
  0 siblings, 0 replies; 22+ messages in thread
From: Jiri Kosina @ 2014-05-02 13:03 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Roland Dreier, Sagi Grimberg, Amir Vadai, Eli Cohen,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel, Saeed Mahameed, Shlomo Pongratz

On Thu, 24 Apr 2014, Or Gerlitz wrote:

> I sent you private note on Mar 19th saying "are you on track with this
> for 3.15? the merge window is coming soon and you have 99% of what you
> need -- lets get there" which seems to be the piece that fell between
> the cracks.

I probably failed to see the implication that you/Mellanox is expecting me 
to deliver the final patch, sorry for that.

> > probably because it wasn't clearly stated *who* will be preparing 
> > patch(es) that'd be implementing the ideas above. My understanding was 
> > that it'd be Mellanox, given that they basically own the driver,
> 
> to be precise partially own the mlx4 HW driver (Roland is the
> mlx4_core/ib author and maintainer, we are asking for few years to get
> co-maintainer hat there, no success so far)

I am of course completely unfamiliar with internal affairs as this, so 
it's not really appropriate for me to comment, but if hardware vendor is 
not able to obtain maintainership of the driver for its own HW, that seems 
to signal some serious issues.

> , the problem you have described can happen with any HW driver, but as 
> stated earlier, the suggested plan will fix ipoib and mlx4 and following 
> that more hw drivers can be enhanced to support that too.
> 
> >
> > have the best testing coverage compared to very very limited
> > testing coverage I can do, and will be pushing it upstream anyway.
> >
> > I believe all the above can be easily created on top of the original patch I sent.
> 
> Indeed, so will take a look next week and let you know if it works for me

Excellent, thanks a lot, looking forward to any updates on this.

Thanks again,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-04-24 17:03           ` Jiri Kosina
@ 2014-04-24 20:01             ` Or Gerlitz
  2014-05-02 13:03               ` Jiri Kosina
  0 siblings, 1 reply; 22+ messages in thread
From: Or Gerlitz @ 2014-04-24 20:01 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Or Gerlitz, Roland Dreier, Sagi Grimberg, Amir Vadai, Eli Cohen,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel, Saeed Mahameed, Shlomo Pongratz

On Thu, Apr 24, 2014 at 8:03 PM, Jiri Kosina <jkosina@suse.cz> wrote:
>
> On Tue, 11 Mar 2014, Or Gerlitz wrote:


[...]
>
> > So sounds like a plan that makes sense?
>
> Hi everybody, seems like this fell through cracks,


Hi Jiri,

I sent you private note on Mar 19th saying "are you on track with this
for 3.15? the merge window is coming soon and you have 99% of what you
need -- lets get there" which seems to be the piece that fell between
the cracks.

>
> probably because it wasn't clearly stated *who* will be preparing patch(es) that'd be implementing the ideas above. My understanding was that it'd be Mellanox, given that they basically own the driver,



to be precise partially own the mlx4 HW driver (Roland is the
mlx4_core/ib author and maintainer, we are asking for few years to get
co-maintainer hat there, no success so far), the problem you have
described can happen with any HW driver, but as stated earlier, the
suggested plan will fix ipoib and mlx4 and following that more hw
drivers can be enhanced to support that too.


>
> have the best testing coverage compared to very very limited
> testing coverage I can do, and will be pushing it upstream anyway.
>
> I believe all the above can be easily created on top of the original patch I sent.



Indeed, so will take a look next week and let you know if it works for me


>
>
> So ... was there a misunderstanding on who is going to do it? :)
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-03-11 13:53         ` Or Gerlitz
  2014-03-14 19:50           ` Jiri Kosina
@ 2014-04-24 17:03           ` Jiri Kosina
  2014-04-24 20:01             ` Or Gerlitz
  1 sibling, 1 reply; 22+ messages in thread
From: Jiri Kosina @ 2014-04-24 17:03 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Sagi Grimberg, Amir Vadai, Eli Cohen,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel, Saeed Mahameed, Shlomo Pongratz

On Tue, 11 Mar 2014, Or Gerlitz wrote:

> > mode on any IB device.  In connected mode, a packet send can trigger
> > establishing a new connection, which will allocate a new QP, which in
> > particular will allocate memory for the QP in the low-level IB device
> > driver.  Currently I'm positive that every driver will do GFP_KERNEL
> > allocations when allocating a QP (ehca does both a GFP_KERNEL
> > kmem_cache allocation and vmalloc in internal_create_qp(), mlx5 and
> > mthca are similar to mlx4 and qib does vmalloc() in qib_create_qp()).
> > So this patch needs to be extended to the other 4 IB device drivers in
> > the tree.
> > 
> > Also, I don't think GFP_NOFS is enough -- it seems we need GFP_NOIO,
> > since we could be swapping to a block device over iSCSI over IPoIB-CM,
> > so even non-FS stuff could deadlock.
> > 
> > I don't think it makes any sense to have a "do_not_deadlock" module
> > parameter, especially one that defaults to "false."  If this is the
> > right thing to do, then we should just unconditionally do it.
> > 
> > It does seem that only using GFP_NOIO when we really need to would be
> > a very difficult problem--how can we carry information about whether a
> > particular packet is involved in freeing memory through all the layers
> > of, say, NFS, TCP, IPSEC, bonding, &c?
> 
> Agree with all the above... next,
> 
> If we don't have away to nicely overcome the layer violations here, let's
> change IPoIB so they always ask the
> IB driver to allocate QPs used for Connected Mode in a GFP_NOIO manner, to be
> practical I suggest the following:
> 
> 1. Add new QP creation flag IB_QP_CREATE_USE_GFP to the existing creation
> flags of struct ib_qp_init_attr
> and a new "gfp_t gfp" field to that structure too
> 
> 2. in the IPoIB CM code, do the vzalloc allocation for new connection in
> GFP_NOIO manner and issue
> the call to create QP with setting the IB_QP_CREATE_USE_GFP flag and GFO_NOIO
> to the gfp field
> 
> 3. If the QP creation fails, with -EINVAL, issue a warning and retry the QP
> creation attempt without the GFP setting
> 
> 4. implement in the mlx4 driver the support for GFP directives on QP creation
> 
> 5. for the rest of the IB drivers, return -EINVAL if IB_QP_CREATE_USE_GFP is
> set
> 
> This will allow to provide working solution for mlx4 users and gradually add
> support for the rest of the IB drivers.
> 
> as for proper patch planning
> 
> patch #1 / items 1 and 5
> patch #2 / item 4
> patch #3 / item 3
> 
> Re item 5 -- I made a check and the ehca, ipath and mthca driver already
> return -EINVAL if provided with any creation flag, so you only need to patch
> the qib driver in qib_create_qp() to do that as well which is trivial.
> 
> As for the rest of the code, you practically have it all by now, just need to
> port the mlx4 changes you did to the suggested framework, remove the module
> param (which you don't like either) and add the new gfp_t field to
> ib_qp_init_attr
> 
> So sounds like a plan that makes sense?

Hi everybody,

seems like this fell through cracks, probably because it wasn't clearly 
stated *who* will be preparing patch(es) that'd be implementing the ideas 
above.

My understanding was that it'd be Mellanox, given that they basically own 
the driver, have the best testing coverage compared to very very limited 
testing coverage I can do, and will be pushing it upstream anyway. 

I believe all the above can be easily created on top of the original patch 
I sent.

So ... was there a misunderstanding on who is going to do it? :)

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-03-11 13:53         ` Or Gerlitz
@ 2014-03-14 19:50           ` Jiri Kosina
  2014-04-24 17:03           ` Jiri Kosina
  1 sibling, 0 replies; 22+ messages in thread
From: Jiri Kosina @ 2014-03-14 19:50 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Sagi Grimberg, Amir Vadai, Eli Cohen,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel, Saeed Mahameed, Shlomo Pongratz

On Tue, 11 Mar 2014, Or Gerlitz wrote:

> > It does seem that only using GFP_NOIO when we really need to would be
> > a very difficult problem--how can we carry information about whether a
> > particular packet is involved in freeing memory through all the layers
> > of, say, NFS, TCP, IPSEC, bonding, &c?
> 
> Agree with all the above... next,
> 
> If we don't have away to nicely overcome the layer violations here, 
> let's change IPoIB so they always ask the IB driver to allocate QPs used 
> for Connected Mode in a GFP_NOIO manner, to be practical I suggest the 
> following:
> 
> 1. Add new QP creation flag IB_QP_CREATE_USE_GFP to the existing 
> creation flags of struct ib_qp_init_attr and a new "gfp_t gfp" field to 
> that structure too
> 
> 2. in the IPoIB CM code, do the vzalloc allocation for new connection in 
> GFP_NOIO manner and issue the call to create QP with setting the 
> IB_QP_CREATE_USE_GFP flag and GFO_NOIO to the gfp field
> 
> 3. If the QP creation fails, with -EINVAL, issue a warning and retry the 
> QP creation attempt without the GFP setting
> 
> 4. implement in the mlx4 driver the support for GFP directives on QP 
> creation

1-4 make perfect sense to me.

> 5. for the rest of the IB drivers, return -EINVAL if 
> IB_QP_CREATE_USE_GFP is set

Umm, why this?

> This will allow to provide working solution for mlx4 users and gradually 
> add support for the rest of the IB drivers.

Oh, I see, so that's just a temporary measure.

> as for proper patch planning
> 
> patch #1 / items 1 and 5
> patch #2 / item 4
> patch #3 / item 3
> 
> Re item 5 -- I made a check and the ehca, ipath and mthca driver already
> return -EINVAL if provided with any creation flag, so you only need to patch
> the qib driver in qib_create_qp() to do that as well which is trivial.
> 
> As for the rest of the code, you practically have it all by now, just need to
> port the mlx4 changes you did to the suggested framework, remove the module
> param (which you don't like either) and add the new gfp_t field to
> ib_qp_init_attr
> 
> So sounds like a plan that makes sense?

Sounds fine by me.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-03-05 19:25       ` Roland Dreier
@ 2014-03-11 13:53         ` Or Gerlitz
  2014-03-14 19:50           ` Jiri Kosina
  2014-04-24 17:03           ` Jiri Kosina
  0 siblings, 2 replies; 22+ messages in thread
From: Or Gerlitz @ 2014-03-11 13:53 UTC (permalink / raw)
  To: Roland Dreier, Jiri Kosina
  Cc: Sagi Grimberg, Amir Vadai, Eli Cohen, Eugenia Emantayev,
	David S. Miller, Mel Gorman, netdev, linux-kernel,
	Saeed Mahameed, Shlomo Pongratz

On 05/03/2014 21:25, Roland Dreier wrote:
> It's quite clear that this is a general problem with IPoIB connected
> mode on any IB device.  In connected mode, a packet send can trigger
> establishing a new connection, which will allocate a new QP, which in
> particular will allocate memory for the QP in the low-level IB device
> driver.  Currently I'm positive that every driver will do GFP_KERNEL
> allocations when allocating a QP (ehca does both a GFP_KERNEL
> kmem_cache allocation and vmalloc in internal_create_qp(), mlx5 and
> mthca are similar to mlx4 and qib does vmalloc() in qib_create_qp()).
> So this patch needs to be extended to the other 4 IB device drivers in
> the tree.
>
> Also, I don't think GFP_NOFS is enough -- it seems we need GFP_NOIO,
> since we could be swapping to a block device over iSCSI over IPoIB-CM,
> so even non-FS stuff could deadlock.
>
> I don't think it makes any sense to have a "do_not_deadlock" module
> parameter, especially one that defaults to "false."  If this is the
> right thing to do, then we should just unconditionally do it.
>
> It does seem that only using GFP_NOIO when we really need to would be
> a very difficult problem--how can we carry information about whether a
> particular packet is involved in freeing memory through all the layers
> of, say, NFS, TCP, IPSEC, bonding, &c?

Agree with all the above... next,

If we don't have away to nicely overcome the layer violations here, 
let's change IPoIB so they always ask the
IB driver to allocate QPs used for Connected Mode in a GFP_NOIO manner, 
to be practical I suggest the following:

1. Add new QP creation flag IB_QP_CREATE_USE_GFP to the existing 
creation flags of struct ib_qp_init_attr
and a new "gfp_t gfp" field to that structure too

2. in the IPoIB CM code, do the vzalloc allocation for new connection in 
GFP_NOIO manner and issue
the call to create QP with setting the IB_QP_CREATE_USE_GFP flag and 
GFO_NOIO to the gfp field

3. If the QP creation fails, with -EINVAL, issue a warning and retry the 
QP creation attempt without the GFP setting

4. implement in the mlx4 driver the support for GFP directives on QP 
creation

5. for the rest of the IB drivers, return -EINVAL if 
IB_QP_CREATE_USE_GFP is set

This will allow to provide working solution for mlx4 users and gradually 
add support for the rest of the IB drivers.

as for proper patch planning

patch #1 / items 1 and 5
patch #2 / item 4
patch #3 / item 3

Re item 5 -- I made a check and the ehca, ipath and mthca driver already 
return -EINVAL if provided with any creation flag, so you only need to 
patch the qib driver in qib_create_qp() to do that as well which is trivial.

As for the rest of the code, you practically have it all by now, just 
need to port the mlx4 changes you did to the suggested framework, remove 
the module param (which you don't like either) and add the new gfp_t 
field to ib_qp_init_attr

So sounds like a plan that makes sense?

Or.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-27 10:42     ` Jiri Kosina
  2014-03-04 22:48       ` Jiri Kosina
@ 2014-03-05 19:25       ` Roland Dreier
  2014-03-11 13:53         ` Or Gerlitz
  1 sibling, 1 reply; 22+ messages in thread
From: Roland Dreier @ 2014-03-05 19:25 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Or Gerlitz, Or Gerlitz, Amir Vadai, Eli Cohen, Eugenia Emantayev,
	David S. Miller, Mel Gorman, netdev, linux-kernel

On Thu, Feb 27, 2014 at 2:42 AM, Jiri Kosina <jkosina@suse.cz> wrote:
> Whatever suits you best. To sum it up:
>
> - mlx4 is confirmed to have this problem, and we know how that problem
>   happens -- see the paragraph in the changelog explaining the dependency
>   between memory reclaim and allocation of TX ring
>
> - we have a work around which requires human interaction in order
>   to provide the information whether GFP_NOFS should be used or not
>
> - I can very well understand why Mellanox would see that as a hack, but if
>   more comprehensive fix is necessary, I'd expect those who understand
>   the code the best to come up with a solution/proposal. I'd assume that
>   you don't  want to keep the code with known and easily triggerable
>   deadlock out there unfixed.
>
> - where I see the potential for layering violation in any 'general'
>   solution is that it's the filesystem that has to be "talking" to the
>   underlying netdevice, i.e. you'll have to make filesystem
>   netdevice-aware, right?

It's quite clear that this is a general problem with IPoIB connected
mode on any IB device.  In connected mode, a packet send can trigger
establishing a new connection, which will allocate a new QP, which in
particular will allocate memory for the QP in the low-level IB device
driver.  Currently I'm positive that every driver will do GFP_KERNEL
allocations when allocating a QP (ehca does both a GFP_KERNEL
kmem_cache allocation and vmalloc in internal_create_qp(), mlx5 and
mthca are similar to mlx4 and qib does vmalloc() in qib_create_qp()).
So this patch needs to be extended to the other 4 IB device drivers in
the tree.

Also, I don't think GFP_NOFS is enough -- it seems we need GFP_NOIO,
since we could be swapping to a block device over iSCSI over IPoIB-CM,
so even non-FS stuff could deadlock.

I don't think it makes any sense to have a "do_not_deadlock" module
parameter, especially one that defaults to "false."  If this is the
right thing to do, then we should just unconditionally do it.

It does seem that only using GFP_NOIO when we really need to would be
a very difficult problem--how can we carry information about whether a
particular packet is involved in freeing memory through all the layers
of, say, NFS, TCP, IPSEC, bonding, &c?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-03-04 22:48       ` Jiri Kosina
@ 2014-03-05 15:57         ` Or Gerlitz
  0 siblings, 0 replies; 22+ messages in thread
From: Or Gerlitz @ 2014-03-05 15:57 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Or Gerlitz, Roland Dreier, Amir Vadai, Eli Cohen,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel, Saeed Mahameed

On 05/03/2014 00:48, Jiri Kosina wrote:
> On Thu, 27 Feb 2014, Jiri Kosina wrote:
>
>> On Thu, 27 Feb 2014, Or Gerlitz wrote:
>>
>>> ipoib is coded over the verbs API (include/rdma/ib_verbs.h)  --- so tracking
>>> the path from ipoib through the verbs api into mlx4 should be similar exercise
>>> as doing so for mlx5, but let's 1st treat the higher level elements involved
>>> with this patch.
>>>
>>> Can you shed some light why the problem happens only for NFS, and not for
>>> example with other IP/TCP storage protocols?
>>>
>>> For example, do you expect it to happen with iSCSI/TCP too? the Linux
>>> iSCSI initiator 1st open a TCP socket from user space to the target,
>>> next they do login exchange over this socket and later provide the
>>> socket to the kernel iscsi code to use as the back-end of a SCSI block
>>> device registered with the SCSI midlayer
>> Frankly, no idea. There was a problem with swapping over NFS, as writeback
>> was deadlocked with memory reclaim (memory needs to be allocated so that
>> swap could be accessed to reclaim memory). That's fixed by allocating the
>> buffers from PF_MEMALLOC reserve, introduced by Mel's and Peter's patchset
>> back in 3.9 or so. Oh, and the same has been done for swapping over NBD,
>> btw. Maybe iSCSI needs similar treatment, maybe it has it already, I
>> haven't checked. We haven't seen a bugreport for that though.
>>
>>>> I don't think we have, and it indeed should be rather easy to add. The
>>>> more challenging part of the problem is where (and based on which
>>>> data) the flag would actually be set up on the netdevice so that it's
>>>> not horrible layering violation.
>>> I assume that in the same manner netdevices advertize features to the
>>> networking core, the core can provide them operating directives after
>>> they register themselves.
>> Whatever suits you best. To sum it up:
>>
>> - mlx4 is confirmed to have this problem, and we know how that problem
>>    happens -- see the paragraph in the changelog explaining the dependency
>>    between memory reclaim and allocation of TX ring
>>
>> - we have a work around which requires human interaction in order
>>    to provide the information whether GFP_NOFS should be used or not
>>
>> - I can very well understand why Mellanox would see that as a hack, but if
>>    more comprehensive fix is necessary, I'd expect those who understand
>>    the code the best to come up with a solution/proposal. I'd assume that
>>    you don't  want to keep the code with known and easily triggerable
>>    deadlock out there unfixed.
>>
>> - where I see the potential for layering violation in any 'general'
>>    solution is that it's the filesystem that has to be "talking" to the
>>    underlying netdevice, i.e. you'll have to make filesystem
>>    netdevice-aware, right?
> Mellanox folks, do you have any plan how to proceed here please?
>

Hi Jiri,

Yep, we will look on that. I think we still have few directions to 
resolve here

1. (our task) deeper understanding of the problem

2. if the solution goes in the way you took it, look for

2.1 a more generic verbs interface, e.g QP creation flag  that dictates 
the GFP_YYY to use when allocating memory
for that QP, e.g NOIO, NOFS, ATOMIC, etc

2.2 a more programmable interface for the file-system to let the NIC 
know they are under constraint YYY for their memory
allocations, maybe per nieghbour? maybe use netdevice private flags?

Or.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-27 10:42     ` Jiri Kosina
@ 2014-03-04 22:48       ` Jiri Kosina
  2014-03-05 15:57         ` Or Gerlitz
  2014-03-05 19:25       ` Roland Dreier
  1 sibling, 1 reply; 22+ messages in thread
From: Jiri Kosina @ 2014-03-04 22:48 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Roland Dreier, Amir Vadai, Eli Cohen,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Thu, 27 Feb 2014, Jiri Kosina wrote:

> On Thu, 27 Feb 2014, Or Gerlitz wrote:
> 
> > ipoib is coded over the verbs API (include/rdma/ib_verbs.h)  --- so tracking
> > the path from ipoib through the verbs api into mlx4 should be similar exercise
> > as doing so for mlx5, but let's 1st treat the higher level elements involved
> > with this patch.
> > 
> > Can you shed some light why the problem happens only for NFS, and not for
> > example with other IP/TCP storage protocols?
> >
> > For example, do you expect it to happen with iSCSI/TCP too? the Linux 
> > iSCSI initiator 1st open a TCP socket from user space to the target, 
> > next they do login exchange over this socket and later provide the 
> > socket to the kernel iscsi code to use as the back-end of a SCSI block 
> > device registered with the SCSI midlayer
> 
> Frankly, no idea. There was a problem with swapping over NFS, as writeback 
> was deadlocked with memory reclaim (memory needs to be allocated so that 
> swap could be accessed to reclaim memory). That's fixed by allocating the 
> buffers from PF_MEMALLOC reserve, introduced by Mel's and Peter's patchset 
> back in 3.9 or so. Oh, and the same has been done for swapping over NBD, 
> btw. Maybe iSCSI needs similar treatment, maybe it has it already, I 
> haven't checked. We haven't seen a bugreport for that though.
> 
> > > I don't think we have, and it indeed should be rather easy to add. The 
> > > more challenging part of the problem is where (and based on which 
> > > data) the flag would actually be set up on the netdevice so that it's 
> > > not horrible layering violation.
> > 
> > I assume that in the same manner netdevices advertize features to the 
> > networking core, the core can provide them operating directives after 
> > they register themselves.
> 
> Whatever suits you best. To sum it up:
> 
> - mlx4 is confirmed to have this problem, and we know how that problem 
>   happens -- see the paragraph in the changelog explaining the dependency 
>   between memory reclaim and allocation of TX ring
> 
> - we have a work around which requires human interaction in order 
>   to provide the information whether GFP_NOFS should be used or not
> 
> - I can very well understand why Mellanox would see that as a hack, but if 
>   more comprehensive fix is necessary, I'd expect those who understand 
>   the code the best to come up with a solution/proposal. I'd assume that 
>   you don't  want to keep the code with known and easily triggerable 
>   deadlock out there unfixed.
> 
> - where I see the potential for layering violation in any 'general' 
>   solution is that it's the filesystem that has to be "talking" to the 
>   underlying netdevice, i.e. you'll have to make filesystem 
>   netdevice-aware, right?

Mellanox folks, do you have any plan how to proceed here please?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-27  9:58   ` Or Gerlitz
@ 2014-02-27 10:42     ` Jiri Kosina
  2014-03-04 22:48       ` Jiri Kosina
  2014-03-05 19:25       ` Roland Dreier
  0 siblings, 2 replies; 22+ messages in thread
From: Jiri Kosina @ 2014-02-27 10:42 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Roland Dreier, Amir Vadai, Eli Cohen,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Thu, 27 Feb 2014, Or Gerlitz wrote:

> ipoib is coded over the verbs API (include/rdma/ib_verbs.h)  --- so tracking
> the path from ipoib through the verbs api into mlx4 should be similar exercise
> as doing so for mlx5, but let's 1st treat the higher level elements involved
> with this patch.
> 
> Can you shed some light why the problem happens only for NFS, and not for
> example with other IP/TCP storage protocols?
>
> For example, do you expect it to happen with iSCSI/TCP too? the Linux 
> iSCSI initiator 1st open a TCP socket from user space to the target, 
> next they do login exchange over this socket and later provide the 
> socket to the kernel iscsi code to use as the back-end of a SCSI block 
> device registered with the SCSI midlayer

Frankly, no idea. There was a problem with swapping over NFS, as writeback 
was deadlocked with memory reclaim (memory needs to be allocated so that 
swap could be accessed to reclaim memory). That's fixed by allocating the 
buffers from PF_MEMALLOC reserve, introduced by Mel's and Peter's patchset 
back in 3.9 or so. Oh, and the same has been done for swapping over NBD, 
btw. Maybe iSCSI needs similar treatment, maybe it has it already, I 
haven't checked. We haven't seen a bugreport for that though.

> > I don't think we have, and it indeed should be rather easy to add. The 
> > more challenging part of the problem is where (and based on which 
> > data) the flag would actually be set up on the netdevice so that it's 
> > not horrible layering violation.
> 
> I assume that in the same manner netdevices advertize features to the 
> networking core, the core can provide them operating directives after 
> they register themselves.

Whatever suits you best. To sum it up:

- mlx4 is confirmed to have this problem, and we know how that problem 
  happens -- see the paragraph in the changelog explaining the dependency 
  between memory reclaim and allocation of TX ring

- we have a work around which requires human interaction in order 
  to provide the information whether GFP_NOFS should be used or not

- I can very well understand why Mellanox would see that as a hack, but if 
  more comprehensive fix is necessary, I'd expect those who understand 
  the code the best to come up with a solution/proposal. I'd assume that 
  you don't  want to keep the code with known and easily triggerable 
  deadlock out there unfixed.

- where I see the potential for layering violation in any 'general' 
  solution is that it's the filesystem that has to be "talking" to the 
  underlying netdevice, i.e. you'll have to make filesystem 
  netdevice-aware, right?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-27  9:48 ` Jiri Kosina
@ 2014-02-27  9:58   ` Or Gerlitz
  2014-02-27 10:42     ` Jiri Kosina
  0 siblings, 1 reply; 22+ messages in thread
From: Or Gerlitz @ 2014-02-27  9:58 UTC (permalink / raw)
  To: Jiri Kosina, Or Gerlitz
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Eugenia Emantayev,
	David S. Miller, Mel Gorman, netdev, linux-kernel

On 27/02/2014 11:48, Jiri Kosina wrote:
> On Wed, 26 Feb 2014, Or Gerlitz wrote:
>
>>> But let's make sure that we don't diverge from the original problem too
>>> much. Simple fact is that the deadlock is there when using connected mode,
>>> and there is nothing preventing users from using it this way, therefore I
>>> believe it should be fixed one way or another.
>> the patch is titled with "mlx4:" -- do you expect the problem to come
>> into play only when ipoib connected mode runs over the mlx4 driver?
>> what's about mlx5 or other upstream IB drivers?
> Honestly, I have no idea. I am pretty sure that Mellanox folks have much
> better understanding of the mlx* driver internals than I do. I tried to
> figure out where mlx5 is standing in this respect, but I don't even see
> where ipoib_cm_tx->tx_ring is being allocated there.

ipoib is coded over the verbs API (include/rdma/ib_verbs.h)  --- so 
tracking the path from ipoib through the verbs api into mlx4 should be 
similar exercise as doing so for mlx5, but let's 1st treat the higher 
level elements involved with this patch.

Can you shed some light why the problem happens only for NFS, and not 
for example with other IP/TCP storage protocols?

For example, do you expect it to happen with iSCSI/TCP too? the Linux 
iSCSI initiator 1st open a TCP socket from user space to the target, 
next they do login exchange over this socket and later provide the 
socket to the kernel iscsi code to use as the back-end of  a SCSI block 
device registered with the SCSI midlayer


>
>> I'll be looking on the details of the problem/solution,
> Awesome, thanks a lot, that's highly appreciated.
>
>> Do we have a way to tell a net-device instance they should do their
>> memory allocations in a NOFS manner? if not, shouldn't we come up with
>> more general injection method?
> I don't think we have, and it indeed should be rather easy to add. The
> more challenging part of the problem is where (and based on which data)
> the flag would actually be set up on the netdevice so that it's not
> horrible layering violation.
>

I assume that in the same manner netdevices advertize features to the 
networking core, the core can provide them
operating directives after they register themselves.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
  2014-02-26 21:18 Or Gerlitz
@ 2014-02-27  9:48 ` Jiri Kosina
  2014-02-27  9:58   ` Or Gerlitz
  0 siblings, 1 reply; 22+ messages in thread
From: Jiri Kosina @ 2014-02-27  9:48 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Wed, 26 Feb 2014, Or Gerlitz wrote:

> > But let's make sure that we don't diverge from the original problem too
> > much. Simple fact is that the deadlock is there when using connected mode,
> > and there is nothing preventing users from using it this way, therefore I
> > believe it should be fixed one way or another.
> 
> the patch is titled with "mlx4:" -- do you expect the problem to come
> into play only when ipoib connected mode runs over the mlx4 driver?
> what's about mlx5 or other upstream IB drivers?

Honestly, I have no idea. I am pretty sure that Mellanox folks have much 
better understanding of the mlx* driver internals than I do. I tried to 
figure out where mlx5 is standing in this respect, but I don't even see 
where ipoib_cm_tx->tx_ring is being allocated there.

> I'll be looking on the details of the problem/solution,

Awesome, thanks a lot, that's highly appreciated.

> Do we have a way to tell a net-device instance they should do their
> memory allocations in a NOFS manner? if not, shouldn't we come up with
> more general injection method?

I don't think we have, and it indeed should be rather easy to add. The 
more challenging part of the problem is where (and based on which data) 
the flag would actually be set up on the netdevice so that it's not 
horrible layering violation.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP
@ 2014-02-26 21:18 Or Gerlitz
  2014-02-27  9:48 ` Jiri Kosina
  0 siblings, 1 reply; 22+ messages in thread
From: Or Gerlitz @ 2014-02-26 21:18 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Roland Dreier, Amir Vadai, Eli Cohen, Or Gerlitz,
	Eugenia Emantayev, David S. Miller, Mel Gorman, netdev,
	linux-kernel

On Wed, Feb 26, 2014 at 12:55 AM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Wed, 26 Feb 2014, Or Gerlitz wrote:
[...]
> That definitely can be verified, and I am putting it on my TODO list.

OK, thanks


> But let's make sure that we don't diverge from the original problem too
> much. Simple fact is that the deadlock is there when using connected mode,
> and there is nothing preventing users from using it this way, therefore I
> believe it should be fixed one way or another.

the patch is titled with "mlx4:" -- do you expect the problem to come
into play only when ipoib connected mode runs over the mlx4 driver?
what's about mlx5 or other upstream IB drivers?

I'll be looking on the details of the problem/solution, but this way
or another the API being module param sounds more like a hack....

Do we have a way to tell a net-device instance they should do their
memory allocations in a NOFS manner? if not, shouldn't we come up with
more general injection method?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-05-02 13:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-21 21:53 [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP Jiri Kosina
     [not found] ` <CAJZOPZK4Ah+nKPWnX3=yM43jbf586GYJ+fh0-OL4bOnqKK8v8A@mail.gmail.com>
2014-02-25 21:52   ` Or Gerlitz
2014-02-25 22:11   ` Jiri Kosina
2014-02-25 22:20     ` Or Gerlitz
2014-02-25 22:40       ` Jiri Kosina
2014-02-25 22:48         ` Or Gerlitz
2014-02-25 22:55           ` Jiri Kosina
2014-03-05 19:46     ` Or Gerlitz
2014-03-06 13:31 ` Or Gerlitz
2014-03-06 13:47   ` Jiri Kosina
2014-02-26 21:18 Or Gerlitz
2014-02-27  9:48 ` Jiri Kosina
2014-02-27  9:58   ` Or Gerlitz
2014-02-27 10:42     ` Jiri Kosina
2014-03-04 22:48       ` Jiri Kosina
2014-03-05 15:57         ` Or Gerlitz
2014-03-05 19:25       ` Roland Dreier
2014-03-11 13:53         ` Or Gerlitz
2014-03-14 19:50           ` Jiri Kosina
2014-04-24 17:03           ` Jiri Kosina
2014-04-24 20:01             ` Or Gerlitz
2014-05-02 13:03               ` Jiri Kosina

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).