* [PATCH for-next 0/4] Add Features & Code improvements for hip08
@ 2017-09-30 9:28 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:28 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
xavier.huwei-hv44wF8Li93QT0dZR+AlfA, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
xushaobo2-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, xavier.huwei-WVlzvzqoTvw,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, shaobohsu-9Onoh4P/yGk,
shaoboxu-WVlzvzqoTvw
This patch-set introduce PBL page size configuration support,IOMMU
support, updating PD&CQE&MTT specification and IRRL table chunk
size for hip08.
Shaobo Xu (1):
RDMA/hns: Support WQE/CQE/PBL page size configurable feature in hip08
Wei Hu (Xavier) (3):
RDMA/hns: Add IOMMU enable support in hip08
RDMA/hns: Update the IRRL table chunk size in hip08
RDMA/hns: Update the PD&CQE&MTT specification in hip08
drivers/infiniband/hw/hns/hns_roce_alloc.c | 34 +++++++----
drivers/infiniband/hw/hns/hns_roce_cq.c | 21 ++++++-
drivers/infiniband/hw/hns/hns_roce_device.h | 13 ++--
drivers/infiniband/hw/hns/hns_roce_hem.c | 61 +++++++++++++------
drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++
drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 +
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 23 ++++---
drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 10 ++--
drivers/infiniband/hw/hns/hns_roce_mr.c | 93 ++++++++++++++++++++---------
drivers/infiniband/hw/hns/hns_roce_qp.c | 46 ++++++++++----
11 files changed, 222 insertions(+), 88 deletions(-)
--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH for-next 0/4] Add Features & Code improvements for hip08
@ 2017-09-30 9:28 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:28 UTC (permalink / raw)
To: dledford
Cc: linux-rdma, xavier.huwei, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
This patch-set introduce PBL page size configuration support,IOMMU
support, updating PD&CQE&MTT specification and IRRL table chunk
size for hip08.
Shaobo Xu (1):
RDMA/hns: Support WQE/CQE/PBL page size configurable feature in hip08
Wei Hu (Xavier) (3):
RDMA/hns: Add IOMMU enable support in hip08
RDMA/hns: Update the IRRL table chunk size in hip08
RDMA/hns: Update the PD&CQE&MTT specification in hip08
drivers/infiniband/hw/hns/hns_roce_alloc.c | 34 +++++++----
drivers/infiniband/hw/hns/hns_roce_cq.c | 21 ++++++-
drivers/infiniband/hw/hns/hns_roce_device.h | 13 ++--
drivers/infiniband/hw/hns/hns_roce_hem.c | 61 +++++++++++++------
drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++
drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 +
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 23 ++++---
drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 10 ++--
drivers/infiniband/hw/hns/hns_roce_mr.c | 93 ++++++++++++++++++++---------
drivers/infiniband/hw/hns/hns_roce_qp.c | 46 ++++++++++----
11 files changed, 222 insertions(+), 88 deletions(-)
--
1.9.1
^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH for-next 1/4] RDMA/hns: Support WQE/CQE/PBL page size configurable feature in hip08
2017-09-30 9:28 ` Wei Hu (Xavier)
@ 2017-09-30 9:28 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:28 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
xavier.huwei-hv44wF8Li93QT0dZR+AlfA, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
xushaobo2-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, xavier.huwei-WVlzvzqoTvw,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, shaobohsu-9Onoh4P/yGk,
shaoboxu-WVlzvzqoTvw
From: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
This patch updates to support WQE, CQE and PBL page size configurable
feature, which includes base address page size and buffer page size.
Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
drivers/infiniband/hw/hns/hns_roce_alloc.c | 29 +++++----
drivers/infiniband/hw/hns/hns_roce_cq.c | 21 ++++++-
drivers/infiniband/hw/hns/hns_roce_device.h | 10 ++--
drivers/infiniband/hw/hns/hns_roce_mr.c | 93 ++++++++++++++++++++---------
drivers/infiniband/hw/hns/hns_roce_qp.c | 46 ++++++++++----
5 files changed, 142 insertions(+), 57 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
index 8c9a33f..3e4c525 100644
--- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
+++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
@@ -167,12 +167,12 @@ void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
if (buf->nbufs == 1) {
dma_free_coherent(dev, size, buf->direct.buf, buf->direct.map);
} else {
- if (bits_per_long == 64)
+ if (bits_per_long == 64 && buf->page_shift == PAGE_SHIFT)
vunmap(buf->direct.buf);
for (i = 0; i < buf->nbufs; ++i)
if (buf->page_list[i].buf)
- dma_free_coherent(dev, PAGE_SIZE,
+ dma_free_coherent(dev, 1 << buf->page_shift,
buf->page_list[i].buf,
buf->page_list[i].map);
kfree(buf->page_list);
@@ -181,20 +181,27 @@ void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
EXPORT_SYMBOL_GPL(hns_roce_buf_free);
int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
- struct hns_roce_buf *buf)
+ struct hns_roce_buf *buf, u32 page_shift)
{
int i = 0;
dma_addr_t t;
struct page **pages;
struct device *dev = hr_dev->dev;
u32 bits_per_long = BITS_PER_LONG;
+ u32 page_size = 1 << page_shift;
+ u32 order;
/* SQ/RQ buf lease than one page, SQ + RQ = 8K */
if (size <= max_direct) {
buf->nbufs = 1;
/* Npages calculated by page_size */
- buf->npages = 1 << get_order(size);
- buf->page_shift = PAGE_SHIFT;
+ order = get_order(size);
+ if (order <= page_shift - PAGE_SHIFT)
+ order = 0;
+ else
+ order -= page_shift - PAGE_SHIFT;
+ buf->npages = 1 << order;
+ buf->page_shift = page_shift;
/* MTT PA must be recorded in 4k alignment, t is 4k aligned */
buf->direct.buf = dma_alloc_coherent(dev, size, &t, GFP_KERNEL);
if (!buf->direct.buf)
@@ -209,9 +216,9 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
memset(buf->direct.buf, 0, size);
} else {
- buf->nbufs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
+ buf->nbufs = (size + page_size - 1) / page_size;
buf->npages = buf->nbufs;
- buf->page_shift = PAGE_SHIFT;
+ buf->page_shift = page_shift;
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
GFP_KERNEL);
@@ -220,16 +227,16 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
for (i = 0; i < buf->nbufs; ++i) {
buf->page_list[i].buf = dma_alloc_coherent(dev,
- PAGE_SIZE, &t,
+ page_size, &t,
GFP_KERNEL);
if (!buf->page_list[i].buf)
goto err_free;
buf->page_list[i].map = t;
- memset(buf->page_list[i].buf, 0, PAGE_SIZE);
+ memset(buf->page_list[i].buf, 0, page_size);
}
- if (bits_per_long == 64) {
+ if (bits_per_long == 64 && page_shift == PAGE_SHIFT) {
pages = kmalloc_array(buf->nbufs, sizeof(*pages),
GFP_KERNEL);
if (!pages)
@@ -243,6 +250,8 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
kfree(pages);
if (!buf->direct.buf)
goto err_free;
+ } else {
+ buf->direct.buf = NULL;
}
}
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index 88cdf6f..f558f95 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -220,6 +220,8 @@ static int hns_roce_ib_get_cq_umem(struct hns_roce_dev *hr_dev,
struct ib_umem **umem, u64 buf_addr, int cqe)
{
int ret;
+ u32 page_shift;
+ u32 npages;
*umem = ib_umem_get(context, buf_addr, cqe * hr_dev->caps.cq_entry_sz,
IB_ACCESS_LOCAL_WRITE, 1);
@@ -230,8 +232,19 @@ static int hns_roce_ib_get_cq_umem(struct hns_roce_dev *hr_dev,
buf->hr_mtt.mtt_type = MTT_TYPE_CQE;
else
buf->hr_mtt.mtt_type = MTT_TYPE_WQE;
- ret = hns_roce_mtt_init(hr_dev, ib_umem_page_count(*umem),
- (*umem)->page_shift, &buf->hr_mtt);
+
+ if (hr_dev->caps.cqe_buf_pg_sz) {
+ npages = (ib_umem_page_count(*umem) +
+ (1 << hr_dev->caps.cqe_buf_pg_sz) - 1) /
+ (1 << hr_dev->caps.cqe_buf_pg_sz);
+ page_shift = PAGE_SHIFT + hr_dev->caps.cqe_buf_pg_sz;
+ ret = hns_roce_mtt_init(hr_dev, npages, page_shift,
+ &buf->hr_mtt);
+ } else {
+ ret = hns_roce_mtt_init(hr_dev, ib_umem_page_count(*umem),
+ (*umem)->page_shift,
+ &buf->hr_mtt);
+ }
if (ret)
goto err_buf;
@@ -253,9 +266,11 @@ static int hns_roce_ib_alloc_cq_buf(struct hns_roce_dev *hr_dev,
struct hns_roce_cq_buf *buf, u32 nent)
{
int ret;
+ u32 page_shift = PAGE_SHIFT + hr_dev->caps.cqe_buf_pg_sz;
ret = hns_roce_buf_alloc(hr_dev, nent * hr_dev->caps.cq_entry_sz,
- PAGE_SIZE * 2, &buf->hr_buf);
+ (1 << page_shift) * 2, &buf->hr_buf,
+ page_shift);
if (ret)
goto out;
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index b314ac0..9353400 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -711,12 +711,14 @@ static inline void hns_roce_write64_k(__be32 val[2], void __iomem *dest)
static inline void *hns_roce_buf_offset(struct hns_roce_buf *buf, int offset)
{
u32 bits_per_long_val = BITS_PER_LONG;
+ u32 page_size = 1 << buf->page_shift;
- if (bits_per_long_val == 64 || buf->nbufs == 1)
+ if ((bits_per_long_val == 64 && buf->page_shift == PAGE_SHIFT) ||
+ buf->nbufs == 1)
return (char *)(buf->direct.buf) + offset;
else
- return (char *)(buf->page_list[offset >> PAGE_SHIFT].buf) +
- (offset & (PAGE_SIZE - 1));
+ return (char *)(buf->page_list[offset >> buf->page_shift].buf) +
+ (offset & (page_size - 1));
}
int hns_roce_init_uar_table(struct hns_roce_dev *dev);
@@ -787,7 +789,7 @@ int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
struct hns_roce_buf *buf);
int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
- struct hns_roce_buf *buf);
+ struct hns_roce_buf *buf, u32 page_shift);
int hns_roce_ib_umem_write_mtt(struct hns_roce_dev *hr_dev,
struct hns_roce_mtt *mtt, struct ib_umem *umem);
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index 452136d..c13b415 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -708,11 +708,17 @@ static int hns_roce_write_mtt_chunk(struct hns_roce_dev *hr_dev,
dma_addr_t dma_handle;
__le64 *mtts;
u32 s = start_index * sizeof(u64);
+ u32 bt_page_size;
u32 i;
+ if (mtt->mtt_type == MTT_TYPE_WQE)
+ bt_page_size = 1 << (hr_dev->caps.mtt_ba_pg_sz + PAGE_SHIFT);
+ else
+ bt_page_size = 1 << (hr_dev->caps.cqe_ba_pg_sz + PAGE_SHIFT);
+
/* All MTTs must fit in the same page */
- if (start_index / (PAGE_SIZE / sizeof(u64)) !=
- (start_index + npages - 1) / (PAGE_SIZE / sizeof(u64)))
+ if (start_index / (bt_page_size / sizeof(u64)) !=
+ (start_index + npages - 1) / (bt_page_size / sizeof(u64)))
return -EINVAL;
if (start_index & (HNS_ROCE_MTT_ENTRY_PER_SEG - 1))
@@ -746,12 +752,18 @@ static int hns_roce_write_mtt(struct hns_roce_dev *hr_dev,
{
int chunk;
int ret;
+ u32 bt_page_size;
if (mtt->order < 0)
return -EINVAL;
+ if (mtt->mtt_type == MTT_TYPE_WQE)
+ bt_page_size = 1 << (hr_dev->caps.mtt_ba_pg_sz + PAGE_SHIFT);
+ else
+ bt_page_size = 1 << (hr_dev->caps.cqe_ba_pg_sz + PAGE_SHIFT);
+
while (npages > 0) {
- chunk = min_t(int, PAGE_SIZE / sizeof(u64), npages);
+ chunk = min_t(int, bt_page_size / sizeof(u64), npages);
ret = hns_roce_write_mtt_chunk(hr_dev, mtt, start_index, chunk,
page_list);
@@ -869,25 +881,44 @@ struct ib_mr *hns_roce_get_dma_mr(struct ib_pd *pd, int acc)
int hns_roce_ib_umem_write_mtt(struct hns_roce_dev *hr_dev,
struct hns_roce_mtt *mtt, struct ib_umem *umem)
{
+ struct device *dev = hr_dev->dev;
struct scatterlist *sg;
+ unsigned int order;
int i, k, entry;
+ int npage = 0;
int ret = 0;
+ int len;
+ u64 page_addr;
u64 *pages;
+ u32 bt_page_size;
u32 n;
- int len;
- pages = (u64 *) __get_free_page(GFP_KERNEL);
+ order = mtt->mtt_type == MTT_TYPE_WQE ? hr_dev->caps.mtt_ba_pg_sz :
+ hr_dev->caps.cqe_ba_pg_sz;
+ bt_page_size = 1 << (order + PAGE_SHIFT);
+
+ pages = (u64 *) __get_free_pages(GFP_KERNEL, order);
if (!pages)
return -ENOMEM;
i = n = 0;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
- len = sg_dma_len(sg) >> mtt->page_shift;
+ len = sg_dma_len(sg) >> PAGE_SHIFT;
for (k = 0; k < len; ++k) {
- pages[i++] = sg_dma_address(sg) +
- (k << umem->page_shift);
- if (i == PAGE_SIZE / sizeof(u64)) {
+ page_addr =
+ sg_dma_address(sg) + (k << umem->page_shift);
+ if (!(npage % (1 << (mtt->page_shift - PAGE_SHIFT)))) {
+ if (page_addr & ((1 << mtt->page_shift) - 1)) {
+ dev_err(dev, "page_addr 0x%llx is not page_shift %d alignment!\n",
+ page_addr, mtt->page_shift);
+ ret = -EINVAL;
+ goto out;
+ }
+ pages[i++] = page_addr;
+ }
+ npage++;
+ if (i == bt_page_size / sizeof(u64)) {
ret = hns_roce_write_mtt(hr_dev, mtt, n, i,
pages);
if (ret)
@@ -911,29 +942,37 @@ static int hns_roce_ib_umem_write_mr(struct hns_roce_dev *hr_dev,
struct ib_umem *umem)
{
struct scatterlist *sg;
- int i = 0, j = 0;
+ int i = 0, j = 0, k = 0;
int entry;
+ int len;
+ u64 page_addr;
+ u32 pbl_bt_sz;
if (hr_dev->caps.pbl_hop_num == HNS_ROCE_HOP_NUM_0)
return 0;
+ pbl_bt_sz = 1 << (hr_dev->caps.pbl_ba_pg_sz + PAGE_SHIFT);
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
- if (!hr_dev->caps.pbl_hop_num) {
- mr->pbl_buf[i] = ((u64)sg_dma_address(sg)) >> 12;
- i++;
- } else if (hr_dev->caps.pbl_hop_num == 1) {
- mr->pbl_buf[i] = sg_dma_address(sg);
- i++;
- } else {
- if (hr_dev->caps.pbl_hop_num == 2)
- mr->pbl_bt_l1[i][j] = sg_dma_address(sg);
- else if (hr_dev->caps.pbl_hop_num == 3)
- mr->pbl_bt_l2[i][j] = sg_dma_address(sg);
-
- j++;
- if (j >= (PAGE_SIZE / 8)) {
- i++;
- j = 0;
+ len = sg_dma_len(sg) >> PAGE_SHIFT;
+ for (k = 0; k < len; ++k) {
+ page_addr = sg_dma_address(sg) +
+ (k << umem->page_shift);
+
+ if (!hr_dev->caps.pbl_hop_num) {
+ mr->pbl_buf[i++] = page_addr >> 12;
+ } else if (hr_dev->caps.pbl_hop_num == 1) {
+ mr->pbl_buf[i++] = page_addr;
+ } else {
+ if (hr_dev->caps.pbl_hop_num == 2)
+ mr->pbl_bt_l1[i][j] = page_addr;
+ else if (hr_dev->caps.pbl_hop_num == 3)
+ mr->pbl_bt_l2[i][j] = page_addr;
+
+ j++;
+ if (j >= (pbl_bt_sz / 8)) {
+ i++;
+ j = 0;
+ }
}
}
}
@@ -986,7 +1025,7 @@ struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
} else {
int pbl_size = 1;
- bt_size = (1 << PAGE_SHIFT) / 8;
+ bt_size = (1 << (hr_dev->caps.pbl_ba_pg_sz + PAGE_SHIFT)) / 8;
for (i = 0; i < hr_dev->caps.pbl_hop_num; i++)
pbl_size *= bt_size;
if (n > pbl_size) {
diff --git a/drivers/infiniband/hw/hns/hns_roce_qp.c b/drivers/infiniband/hw/hns/hns_roce_qp.c
index e6d1115..b1c9a37 100644
--- a/drivers/infiniband/hw/hns/hns_roce_qp.c
+++ b/drivers/infiniband/hw/hns/hns_roce_qp.c
@@ -322,6 +322,7 @@ static int hns_roce_set_user_sq_size(struct hns_roce_dev *hr_dev,
{
u32 roundup_sq_stride = roundup_pow_of_two(hr_dev->caps.max_sq_desc_sz);
u8 max_sq_stride = ilog2(roundup_sq_stride);
+ u32 page_size;
u32 max_cnt;
/* Sanity check SQ size before proceeding */
@@ -363,28 +364,29 @@ static int hns_roce_set_user_sq_size(struct hns_roce_dev *hr_dev,
hr_qp->rq.offset = HNS_ROCE_ALOGN_UP((hr_qp->sq.wqe_cnt <<
hr_qp->sq.wqe_shift), PAGE_SIZE);
} else {
+ page_size = 1 << (hr_dev->caps.mtt_buf_pg_sz + PAGE_SHIFT);
hr_qp->buff_size = HNS_ROCE_ALOGN_UP((hr_qp->rq.wqe_cnt <<
- hr_qp->rq.wqe_shift), PAGE_SIZE) +
+ hr_qp->rq.wqe_shift), page_size) +
HNS_ROCE_ALOGN_UP((hr_qp->sge.sge_cnt <<
- hr_qp->sge.sge_shift), PAGE_SIZE) +
+ hr_qp->sge.sge_shift), page_size) +
HNS_ROCE_ALOGN_UP((hr_qp->sq.wqe_cnt <<
- hr_qp->sq.wqe_shift), PAGE_SIZE);
+ hr_qp->sq.wqe_shift), page_size);
hr_qp->sq.offset = 0;
if (hr_qp->sge.sge_cnt) {
hr_qp->sge.offset = HNS_ROCE_ALOGN_UP(
(hr_qp->sq.wqe_cnt <<
hr_qp->sq.wqe_shift),
- PAGE_SIZE);
+ page_size);
hr_qp->rq.offset = hr_qp->sge.offset +
HNS_ROCE_ALOGN_UP((hr_qp->sge.sge_cnt <<
hr_qp->sge.sge_shift),
- PAGE_SIZE);
+ page_size);
} else {
hr_qp->rq.offset = HNS_ROCE_ALOGN_UP(
(hr_qp->sq.wqe_cnt <<
hr_qp->sq.wqe_shift),
- PAGE_SIZE);
+ page_size);
}
}
@@ -396,6 +398,7 @@ static int hns_roce_set_kernel_sq_size(struct hns_roce_dev *hr_dev,
struct hns_roce_qp *hr_qp)
{
struct device *dev = hr_dev->dev;
+ u32 page_size;
u32 max_cnt;
int size;
@@ -435,19 +438,20 @@ static int hns_roce_set_kernel_sq_size(struct hns_roce_dev *hr_dev,
}
/* Get buf size, SQ and RQ are aligned to PAGE_SIZE */
+ page_size = 1 << (hr_dev->caps.mtt_buf_pg_sz + PAGE_SHIFT);
hr_qp->sq.offset = 0;
size = HNS_ROCE_ALOGN_UP(hr_qp->sq.wqe_cnt << hr_qp->sq.wqe_shift,
- PAGE_SIZE);
+ page_size);
if (hr_dev->caps.max_sq_sg > 2 && hr_qp->sge.sge_cnt) {
hr_qp->sge.offset = size;
size += HNS_ROCE_ALOGN_UP(hr_qp->sge.sge_cnt <<
- hr_qp->sge.sge_shift, PAGE_SIZE);
+ hr_qp->sge.sge_shift, page_size);
}
hr_qp->rq.offset = size;
size += HNS_ROCE_ALOGN_UP((hr_qp->rq.wqe_cnt << hr_qp->rq.wqe_shift),
- PAGE_SIZE);
+ page_size);
hr_qp->buff_size = size;
/* Get wr and sge number which send */
@@ -470,6 +474,8 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
struct hns_roce_ib_create_qp ucmd;
unsigned long qpn = 0;
int ret = 0;
+ u32 page_shift;
+ u32 npages;
mutex_init(&hr_qp->mutex);
spin_lock_init(&hr_qp->sq.lock);
@@ -513,8 +519,20 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
}
hr_qp->mtt.mtt_type = MTT_TYPE_WQE;
- ret = hns_roce_mtt_init(hr_dev, ib_umem_page_count(hr_qp->umem),
- hr_qp->umem->page_shift, &hr_qp->mtt);
+ if (hr_dev->caps.mtt_buf_pg_sz) {
+ npages = (ib_umem_page_count(hr_qp->umem) +
+ (1 << hr_dev->caps.mtt_buf_pg_sz) - 1) /
+ (1 << hr_dev->caps.mtt_buf_pg_sz);
+ page_shift = PAGE_SHIFT + hr_dev->caps.mtt_buf_pg_sz;
+ ret = hns_roce_mtt_init(hr_dev, npages,
+ page_shift,
+ &hr_qp->mtt);
+ } else {
+ ret = hns_roce_mtt_init(hr_dev,
+ ib_umem_page_count(hr_qp->umem),
+ hr_qp->umem->page_shift,
+ &hr_qp->mtt);
+ }
if (ret) {
dev_err(dev, "hns_roce_mtt_init error for create qp\n");
goto err_buf;
@@ -555,8 +573,10 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
DB_REG_OFFSET * hr_dev->priv_uar.index;
/* Allocate QP buf */
- if (hns_roce_buf_alloc(hr_dev, hr_qp->buff_size, PAGE_SIZE * 2,
- &hr_qp->hr_buf)) {
+ page_shift = PAGE_SHIFT + hr_dev->caps.mtt_buf_pg_sz;
+ if (hns_roce_buf_alloc(hr_dev, hr_qp->buff_size,
+ (1 << page_shift) * 2,
+ &hr_qp->hr_buf, page_shift)) {
dev_err(dev, "hns_roce_buf_alloc error!\n");
ret = -ENOMEM;
goto err_out;
--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH for-next 1/4] RDMA/hns: Support WQE/CQE/PBL page size configurable feature in hip08
@ 2017-09-30 9:28 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:28 UTC (permalink / raw)
To: dledford
Cc: linux-rdma, xavier.huwei, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
From: Shaobo Xu <xushaobo2@huawei.com>
This patch updates to support WQE, CQE and PBL page size configurable
feature, which includes base address page size and buffer page size.
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
---
drivers/infiniband/hw/hns/hns_roce_alloc.c | 29 +++++----
drivers/infiniband/hw/hns/hns_roce_cq.c | 21 ++++++-
drivers/infiniband/hw/hns/hns_roce_device.h | 10 ++--
drivers/infiniband/hw/hns/hns_roce_mr.c | 93 ++++++++++++++++++++---------
drivers/infiniband/hw/hns/hns_roce_qp.c | 46 ++++++++++----
5 files changed, 142 insertions(+), 57 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
index 8c9a33f..3e4c525 100644
--- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
+++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
@@ -167,12 +167,12 @@ void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
if (buf->nbufs == 1) {
dma_free_coherent(dev, size, buf->direct.buf, buf->direct.map);
} else {
- if (bits_per_long == 64)
+ if (bits_per_long == 64 && buf->page_shift == PAGE_SHIFT)
vunmap(buf->direct.buf);
for (i = 0; i < buf->nbufs; ++i)
if (buf->page_list[i].buf)
- dma_free_coherent(dev, PAGE_SIZE,
+ dma_free_coherent(dev, 1 << buf->page_shift,
buf->page_list[i].buf,
buf->page_list[i].map);
kfree(buf->page_list);
@@ -181,20 +181,27 @@ void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
EXPORT_SYMBOL_GPL(hns_roce_buf_free);
int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
- struct hns_roce_buf *buf)
+ struct hns_roce_buf *buf, u32 page_shift)
{
int i = 0;
dma_addr_t t;
struct page **pages;
struct device *dev = hr_dev->dev;
u32 bits_per_long = BITS_PER_LONG;
+ u32 page_size = 1 << page_shift;
+ u32 order;
/* SQ/RQ buf lease than one page, SQ + RQ = 8K */
if (size <= max_direct) {
buf->nbufs = 1;
/* Npages calculated by page_size */
- buf->npages = 1 << get_order(size);
- buf->page_shift = PAGE_SHIFT;
+ order = get_order(size);
+ if (order <= page_shift - PAGE_SHIFT)
+ order = 0;
+ else
+ order -= page_shift - PAGE_SHIFT;
+ buf->npages = 1 << order;
+ buf->page_shift = page_shift;
/* MTT PA must be recorded in 4k alignment, t is 4k aligned */
buf->direct.buf = dma_alloc_coherent(dev, size, &t, GFP_KERNEL);
if (!buf->direct.buf)
@@ -209,9 +216,9 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
memset(buf->direct.buf, 0, size);
} else {
- buf->nbufs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
+ buf->nbufs = (size + page_size - 1) / page_size;
buf->npages = buf->nbufs;
- buf->page_shift = PAGE_SHIFT;
+ buf->page_shift = page_shift;
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
GFP_KERNEL);
@@ -220,16 +227,16 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
for (i = 0; i < buf->nbufs; ++i) {
buf->page_list[i].buf = dma_alloc_coherent(dev,
- PAGE_SIZE, &t,
+ page_size, &t,
GFP_KERNEL);
if (!buf->page_list[i].buf)
goto err_free;
buf->page_list[i].map = t;
- memset(buf->page_list[i].buf, 0, PAGE_SIZE);
+ memset(buf->page_list[i].buf, 0, page_size);
}
- if (bits_per_long == 64) {
+ if (bits_per_long == 64 && page_shift == PAGE_SHIFT) {
pages = kmalloc_array(buf->nbufs, sizeof(*pages),
GFP_KERNEL);
if (!pages)
@@ -243,6 +250,8 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
kfree(pages);
if (!buf->direct.buf)
goto err_free;
+ } else {
+ buf->direct.buf = NULL;
}
}
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index 88cdf6f..f558f95 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -220,6 +220,8 @@ static int hns_roce_ib_get_cq_umem(struct hns_roce_dev *hr_dev,
struct ib_umem **umem, u64 buf_addr, int cqe)
{
int ret;
+ u32 page_shift;
+ u32 npages;
*umem = ib_umem_get(context, buf_addr, cqe * hr_dev->caps.cq_entry_sz,
IB_ACCESS_LOCAL_WRITE, 1);
@@ -230,8 +232,19 @@ static int hns_roce_ib_get_cq_umem(struct hns_roce_dev *hr_dev,
buf->hr_mtt.mtt_type = MTT_TYPE_CQE;
else
buf->hr_mtt.mtt_type = MTT_TYPE_WQE;
- ret = hns_roce_mtt_init(hr_dev, ib_umem_page_count(*umem),
- (*umem)->page_shift, &buf->hr_mtt);
+
+ if (hr_dev->caps.cqe_buf_pg_sz) {
+ npages = (ib_umem_page_count(*umem) +
+ (1 << hr_dev->caps.cqe_buf_pg_sz) - 1) /
+ (1 << hr_dev->caps.cqe_buf_pg_sz);
+ page_shift = PAGE_SHIFT + hr_dev->caps.cqe_buf_pg_sz;
+ ret = hns_roce_mtt_init(hr_dev, npages, page_shift,
+ &buf->hr_mtt);
+ } else {
+ ret = hns_roce_mtt_init(hr_dev, ib_umem_page_count(*umem),
+ (*umem)->page_shift,
+ &buf->hr_mtt);
+ }
if (ret)
goto err_buf;
@@ -253,9 +266,11 @@ static int hns_roce_ib_alloc_cq_buf(struct hns_roce_dev *hr_dev,
struct hns_roce_cq_buf *buf, u32 nent)
{
int ret;
+ u32 page_shift = PAGE_SHIFT + hr_dev->caps.cqe_buf_pg_sz;
ret = hns_roce_buf_alloc(hr_dev, nent * hr_dev->caps.cq_entry_sz,
- PAGE_SIZE * 2, &buf->hr_buf);
+ (1 << page_shift) * 2, &buf->hr_buf,
+ page_shift);
if (ret)
goto out;
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index b314ac0..9353400 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -711,12 +711,14 @@ static inline void hns_roce_write64_k(__be32 val[2], void __iomem *dest)
static inline void *hns_roce_buf_offset(struct hns_roce_buf *buf, int offset)
{
u32 bits_per_long_val = BITS_PER_LONG;
+ u32 page_size = 1 << buf->page_shift;
- if (bits_per_long_val == 64 || buf->nbufs == 1)
+ if ((bits_per_long_val == 64 && buf->page_shift == PAGE_SHIFT) ||
+ buf->nbufs == 1)
return (char *)(buf->direct.buf) + offset;
else
- return (char *)(buf->page_list[offset >> PAGE_SHIFT].buf) +
- (offset & (PAGE_SIZE - 1));
+ return (char *)(buf->page_list[offset >> buf->page_shift].buf) +
+ (offset & (page_size - 1));
}
int hns_roce_init_uar_table(struct hns_roce_dev *dev);
@@ -787,7 +789,7 @@ int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
struct hns_roce_buf *buf);
int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
- struct hns_roce_buf *buf);
+ struct hns_roce_buf *buf, u32 page_shift);
int hns_roce_ib_umem_write_mtt(struct hns_roce_dev *hr_dev,
struct hns_roce_mtt *mtt, struct ib_umem *umem);
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index 452136d..c13b415 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -708,11 +708,17 @@ static int hns_roce_write_mtt_chunk(struct hns_roce_dev *hr_dev,
dma_addr_t dma_handle;
__le64 *mtts;
u32 s = start_index * sizeof(u64);
+ u32 bt_page_size;
u32 i;
+ if (mtt->mtt_type == MTT_TYPE_WQE)
+ bt_page_size = 1 << (hr_dev->caps.mtt_ba_pg_sz + PAGE_SHIFT);
+ else
+ bt_page_size = 1 << (hr_dev->caps.cqe_ba_pg_sz + PAGE_SHIFT);
+
/* All MTTs must fit in the same page */
- if (start_index / (PAGE_SIZE / sizeof(u64)) !=
- (start_index + npages - 1) / (PAGE_SIZE / sizeof(u64)))
+ if (start_index / (bt_page_size / sizeof(u64)) !=
+ (start_index + npages - 1) / (bt_page_size / sizeof(u64)))
return -EINVAL;
if (start_index & (HNS_ROCE_MTT_ENTRY_PER_SEG - 1))
@@ -746,12 +752,18 @@ static int hns_roce_write_mtt(struct hns_roce_dev *hr_dev,
{
int chunk;
int ret;
+ u32 bt_page_size;
if (mtt->order < 0)
return -EINVAL;
+ if (mtt->mtt_type == MTT_TYPE_WQE)
+ bt_page_size = 1 << (hr_dev->caps.mtt_ba_pg_sz + PAGE_SHIFT);
+ else
+ bt_page_size = 1 << (hr_dev->caps.cqe_ba_pg_sz + PAGE_SHIFT);
+
while (npages > 0) {
- chunk = min_t(int, PAGE_SIZE / sizeof(u64), npages);
+ chunk = min_t(int, bt_page_size / sizeof(u64), npages);
ret = hns_roce_write_mtt_chunk(hr_dev, mtt, start_index, chunk,
page_list);
@@ -869,25 +881,44 @@ struct ib_mr *hns_roce_get_dma_mr(struct ib_pd *pd, int acc)
int hns_roce_ib_umem_write_mtt(struct hns_roce_dev *hr_dev,
struct hns_roce_mtt *mtt, struct ib_umem *umem)
{
+ struct device *dev = hr_dev->dev;
struct scatterlist *sg;
+ unsigned int order;
int i, k, entry;
+ int npage = 0;
int ret = 0;
+ int len;
+ u64 page_addr;
u64 *pages;
+ u32 bt_page_size;
u32 n;
- int len;
- pages = (u64 *) __get_free_page(GFP_KERNEL);
+ order = mtt->mtt_type == MTT_TYPE_WQE ? hr_dev->caps.mtt_ba_pg_sz :
+ hr_dev->caps.cqe_ba_pg_sz;
+ bt_page_size = 1 << (order + PAGE_SHIFT);
+
+ pages = (u64 *) __get_free_pages(GFP_KERNEL, order);
if (!pages)
return -ENOMEM;
i = n = 0;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
- len = sg_dma_len(sg) >> mtt->page_shift;
+ len = sg_dma_len(sg) >> PAGE_SHIFT;
for (k = 0; k < len; ++k) {
- pages[i++] = sg_dma_address(sg) +
- (k << umem->page_shift);
- if (i == PAGE_SIZE / sizeof(u64)) {
+ page_addr =
+ sg_dma_address(sg) + (k << umem->page_shift);
+ if (!(npage % (1 << (mtt->page_shift - PAGE_SHIFT)))) {
+ if (page_addr & ((1 << mtt->page_shift) - 1)) {
+ dev_err(dev, "page_addr 0x%llx is not page_shift %d alignment!\n",
+ page_addr, mtt->page_shift);
+ ret = -EINVAL;
+ goto out;
+ }
+ pages[i++] = page_addr;
+ }
+ npage++;
+ if (i == bt_page_size / sizeof(u64)) {
ret = hns_roce_write_mtt(hr_dev, mtt, n, i,
pages);
if (ret)
@@ -911,29 +942,37 @@ static int hns_roce_ib_umem_write_mr(struct hns_roce_dev *hr_dev,
struct ib_umem *umem)
{
struct scatterlist *sg;
- int i = 0, j = 0;
+ int i = 0, j = 0, k = 0;
int entry;
+ int len;
+ u64 page_addr;
+ u32 pbl_bt_sz;
if (hr_dev->caps.pbl_hop_num == HNS_ROCE_HOP_NUM_0)
return 0;
+ pbl_bt_sz = 1 << (hr_dev->caps.pbl_ba_pg_sz + PAGE_SHIFT);
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
- if (!hr_dev->caps.pbl_hop_num) {
- mr->pbl_buf[i] = ((u64)sg_dma_address(sg)) >> 12;
- i++;
- } else if (hr_dev->caps.pbl_hop_num == 1) {
- mr->pbl_buf[i] = sg_dma_address(sg);
- i++;
- } else {
- if (hr_dev->caps.pbl_hop_num == 2)
- mr->pbl_bt_l1[i][j] = sg_dma_address(sg);
- else if (hr_dev->caps.pbl_hop_num == 3)
- mr->pbl_bt_l2[i][j] = sg_dma_address(sg);
-
- j++;
- if (j >= (PAGE_SIZE / 8)) {
- i++;
- j = 0;
+ len = sg_dma_len(sg) >> PAGE_SHIFT;
+ for (k = 0; k < len; ++k) {
+ page_addr = sg_dma_address(sg) +
+ (k << umem->page_shift);
+
+ if (!hr_dev->caps.pbl_hop_num) {
+ mr->pbl_buf[i++] = page_addr >> 12;
+ } else if (hr_dev->caps.pbl_hop_num == 1) {
+ mr->pbl_buf[i++] = page_addr;
+ } else {
+ if (hr_dev->caps.pbl_hop_num == 2)
+ mr->pbl_bt_l1[i][j] = page_addr;
+ else if (hr_dev->caps.pbl_hop_num == 3)
+ mr->pbl_bt_l2[i][j] = page_addr;
+
+ j++;
+ if (j >= (pbl_bt_sz / 8)) {
+ i++;
+ j = 0;
+ }
}
}
}
@@ -986,7 +1025,7 @@ struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
} else {
int pbl_size = 1;
- bt_size = (1 << PAGE_SHIFT) / 8;
+ bt_size = (1 << (hr_dev->caps.pbl_ba_pg_sz + PAGE_SHIFT)) / 8;
for (i = 0; i < hr_dev->caps.pbl_hop_num; i++)
pbl_size *= bt_size;
if (n > pbl_size) {
diff --git a/drivers/infiniband/hw/hns/hns_roce_qp.c b/drivers/infiniband/hw/hns/hns_roce_qp.c
index e6d1115..b1c9a37 100644
--- a/drivers/infiniband/hw/hns/hns_roce_qp.c
+++ b/drivers/infiniband/hw/hns/hns_roce_qp.c
@@ -322,6 +322,7 @@ static int hns_roce_set_user_sq_size(struct hns_roce_dev *hr_dev,
{
u32 roundup_sq_stride = roundup_pow_of_two(hr_dev->caps.max_sq_desc_sz);
u8 max_sq_stride = ilog2(roundup_sq_stride);
+ u32 page_size;
u32 max_cnt;
/* Sanity check SQ size before proceeding */
@@ -363,28 +364,29 @@ static int hns_roce_set_user_sq_size(struct hns_roce_dev *hr_dev,
hr_qp->rq.offset = HNS_ROCE_ALOGN_UP((hr_qp->sq.wqe_cnt <<
hr_qp->sq.wqe_shift), PAGE_SIZE);
} else {
+ page_size = 1 << (hr_dev->caps.mtt_buf_pg_sz + PAGE_SHIFT);
hr_qp->buff_size = HNS_ROCE_ALOGN_UP((hr_qp->rq.wqe_cnt <<
- hr_qp->rq.wqe_shift), PAGE_SIZE) +
+ hr_qp->rq.wqe_shift), page_size) +
HNS_ROCE_ALOGN_UP((hr_qp->sge.sge_cnt <<
- hr_qp->sge.sge_shift), PAGE_SIZE) +
+ hr_qp->sge.sge_shift), page_size) +
HNS_ROCE_ALOGN_UP((hr_qp->sq.wqe_cnt <<
- hr_qp->sq.wqe_shift), PAGE_SIZE);
+ hr_qp->sq.wqe_shift), page_size);
hr_qp->sq.offset = 0;
if (hr_qp->sge.sge_cnt) {
hr_qp->sge.offset = HNS_ROCE_ALOGN_UP(
(hr_qp->sq.wqe_cnt <<
hr_qp->sq.wqe_shift),
- PAGE_SIZE);
+ page_size);
hr_qp->rq.offset = hr_qp->sge.offset +
HNS_ROCE_ALOGN_UP((hr_qp->sge.sge_cnt <<
hr_qp->sge.sge_shift),
- PAGE_SIZE);
+ page_size);
} else {
hr_qp->rq.offset = HNS_ROCE_ALOGN_UP(
(hr_qp->sq.wqe_cnt <<
hr_qp->sq.wqe_shift),
- PAGE_SIZE);
+ page_size);
}
}
@@ -396,6 +398,7 @@ static int hns_roce_set_kernel_sq_size(struct hns_roce_dev *hr_dev,
struct hns_roce_qp *hr_qp)
{
struct device *dev = hr_dev->dev;
+ u32 page_size;
u32 max_cnt;
int size;
@@ -435,19 +438,20 @@ static int hns_roce_set_kernel_sq_size(struct hns_roce_dev *hr_dev,
}
/* Get buf size, SQ and RQ are aligned to PAGE_SIZE */
+ page_size = 1 << (hr_dev->caps.mtt_buf_pg_sz + PAGE_SHIFT);
hr_qp->sq.offset = 0;
size = HNS_ROCE_ALOGN_UP(hr_qp->sq.wqe_cnt << hr_qp->sq.wqe_shift,
- PAGE_SIZE);
+ page_size);
if (hr_dev->caps.max_sq_sg > 2 && hr_qp->sge.sge_cnt) {
hr_qp->sge.offset = size;
size += HNS_ROCE_ALOGN_UP(hr_qp->sge.sge_cnt <<
- hr_qp->sge.sge_shift, PAGE_SIZE);
+ hr_qp->sge.sge_shift, page_size);
}
hr_qp->rq.offset = size;
size += HNS_ROCE_ALOGN_UP((hr_qp->rq.wqe_cnt << hr_qp->rq.wqe_shift),
- PAGE_SIZE);
+ page_size);
hr_qp->buff_size = size;
/* Get wr and sge number which send */
@@ -470,6 +474,8 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
struct hns_roce_ib_create_qp ucmd;
unsigned long qpn = 0;
int ret = 0;
+ u32 page_shift;
+ u32 npages;
mutex_init(&hr_qp->mutex);
spin_lock_init(&hr_qp->sq.lock);
@@ -513,8 +519,20 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
}
hr_qp->mtt.mtt_type = MTT_TYPE_WQE;
- ret = hns_roce_mtt_init(hr_dev, ib_umem_page_count(hr_qp->umem),
- hr_qp->umem->page_shift, &hr_qp->mtt);
+ if (hr_dev->caps.mtt_buf_pg_sz) {
+ npages = (ib_umem_page_count(hr_qp->umem) +
+ (1 << hr_dev->caps.mtt_buf_pg_sz) - 1) /
+ (1 << hr_dev->caps.mtt_buf_pg_sz);
+ page_shift = PAGE_SHIFT + hr_dev->caps.mtt_buf_pg_sz;
+ ret = hns_roce_mtt_init(hr_dev, npages,
+ page_shift,
+ &hr_qp->mtt);
+ } else {
+ ret = hns_roce_mtt_init(hr_dev,
+ ib_umem_page_count(hr_qp->umem),
+ hr_qp->umem->page_shift,
+ &hr_qp->mtt);
+ }
if (ret) {
dev_err(dev, "hns_roce_mtt_init error for create qp\n");
goto err_buf;
@@ -555,8 +573,10 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
DB_REG_OFFSET * hr_dev->priv_uar.index;
/* Allocate QP buf */
- if (hns_roce_buf_alloc(hr_dev, hr_qp->buff_size, PAGE_SIZE * 2,
- &hr_qp->hr_buf)) {
+ page_shift = PAGE_SHIFT + hr_dev->caps.mtt_buf_pg_sz;
+ if (hns_roce_buf_alloc(hr_dev, hr_qp->buff_size,
+ (1 << page_shift) * 2,
+ &hr_qp->hr_buf, page_shift)) {
dev_err(dev, "hns_roce_buf_alloc error!\n");
ret = -ENOMEM;
goto err_out;
--
1.9.1
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-09-30 9:28 ` Wei Hu (Xavier)
@ 2017-09-30 9:28 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:28 UTC (permalink / raw)
To: dledford
Cc: linux-rdma, xavier.huwei, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
If the IOMMU is enabled, the length of sg obtained from
__iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
dma address, the IOVA will not be page continuous. and the VA
returned from dma_alloc_coherent is a vmalloc address. However,
the VA obtained by the page_address is a discontinuous VA. Under
these circumstances, the IOVA should be calculated based on the
sg length, and record the VA returned from dma_alloc_coherent
in the struct of hem.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
---
drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
4 files changed, 52 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
index 3e4c525..a69cd4b 100644
--- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
+++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
@@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
- pages[i] = virt_to_page(buf->page_list[i].buf);
+ pages[i] =
+ is_vmalloc_addr(buf->page_list[i].buf) ?
+ vmalloc_to_page(buf->page_list[i].buf) :
+ virt_to_page(buf->page_list[i].buf);
buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
PAGE_KERNEL);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
index 8388ae2..4a3d1d4 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hem.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
@@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
gfp_t gfp_mask)
{
struct hns_roce_hem_chunk *chunk = NULL;
+ struct hns_roce_vmalloc *vmalloc;
struct hns_roce_hem *hem;
struct scatterlist *mem;
int order;
@@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
chunk->npages = 0;
chunk->nsg = 0;
+ memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
list_add_tail(&chunk->list, &hem->chunk_list);
}
@@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
if (!buf)
goto fail;
- sg_set_buf(mem, buf, PAGE_SIZE << order);
+ if (is_vmalloc_addr(buf)) {
+ vmalloc = &chunk->vmalloc[chunk->npages];
+ vmalloc->is_vmalloc_addr = true;
+ vmalloc->vmalloc_addr = buf;
+ sg_set_page(mem, vmalloc_to_page(buf),
+ PAGE_SIZE << order, offset_in_page(buf));
+ } else {
+ sg_set_buf(mem, buf, PAGE_SIZE << order);
+ }
WARN_ON(mem->offset);
sg_dma_len(mem) = PAGE_SIZE << order;
@@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
{
struct hns_roce_hem_chunk *chunk, *tmp;
+ void *cpu_addr;
int i;
if (!hem)
return;
list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
- for (i = 0; i < chunk->npages; ++i)
+ for (i = 0; i < chunk->npages; ++i) {
+ if (chunk->vmalloc[i].is_vmalloc_addr)
+ cpu_addr = chunk->vmalloc[i].vmalloc_addr;
+ else
+ cpu_addr =
+ lowmem_page_address(sg_page(&chunk->mem[i]));
+
dma_free_coherent(hr_dev->dev,
chunk->mem[i].length,
- lowmem_page_address(sg_page(&chunk->mem[i])),
+ cpu_addr,
sg_dma_address(&chunk->mem[i]));
+ }
kfree(chunk);
}
@@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
if (chunk->mem[i].length > (u32)offset) {
page = sg_page(&chunk->mem[i]);
+ if (chunk->vmalloc[i].is_vmalloc_addr) {
+ mutex_unlock(&table->mutex);
+ return page ?
+ chunk->vmalloc[i].vmalloc_addr
+ + offset : NULL;
+ }
goto out;
}
offset -= chunk->mem[i].length;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
index af28bbf..62d712a 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hem.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
@@ -72,11 +72,17 @@ enum {
HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
};
+struct hns_roce_vmalloc {
+ bool is_vmalloc_addr;
+ void *vmalloc_addr;
+};
+
struct hns_roce_hem_chunk {
struct list_head list;
int npages;
int nsg;
struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
+ struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
};
struct hns_roce_hem {
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index b99d70a..9e19bf1 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
{
struct hns_roce_v2_mpt_entry *mpt_entry;
struct scatterlist *sg;
+ u64 page_addr = 0;
u64 *pages;
+ int i = 0, j = 0;
+ int len = 0;
int entry;
- int i;
mpt_entry = mb_buf;
memset(mpt_entry, 0, sizeof(*mpt_entry));
@@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
i = 0;
for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
- pages[i] = ((u64)sg_dma_address(sg)) >> 6;
-
- /* Record the first 2 entry directly to MTPT table */
- if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
- break;
- i++;
+ len = sg_dma_len(sg) >> PAGE_SHIFT;
+ for (j = 0; j < len; ++j) {
+ page_addr = sg_dma_address(sg) +
+ (j << mr->umem->page_shift);
+ pages[i] = page_addr >> 6;
+
+ /* Record the first 2 entry directly to MTPT table */
+ if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
+ goto found;
+ i++;
+ }
}
+found:
mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
V2_MPT_BYTE_56_PA0_H_S,
--
1.9.1
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-09-30 9:28 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:28 UTC (permalink / raw)
To: dledford
Cc: linux-rdma, xavier.huwei, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
If the IOMMU is enabled, the length of sg obtained from
__iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
dma address, the IOVA will not be page continuous. and the VA
returned from dma_alloc_coherent is a vmalloc address. However,
the VA obtained by the page_address is a discontinuous VA. Under
these circumstances, the IOVA should be calculated based on the
sg length, and record the VA returned from dma_alloc_coherent
in the struct of hem.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
---
drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
4 files changed, 52 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
index 3e4c525..a69cd4b 100644
--- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
+++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
@@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
- pages[i] = virt_to_page(buf->page_list[i].buf);
+ pages[i] =
+ is_vmalloc_addr(buf->page_list[i].buf) ?
+ vmalloc_to_page(buf->page_list[i].buf) :
+ virt_to_page(buf->page_list[i].buf);
buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
PAGE_KERNEL);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
index 8388ae2..4a3d1d4 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hem.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
@@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
gfp_t gfp_mask)
{
struct hns_roce_hem_chunk *chunk = NULL;
+ struct hns_roce_vmalloc *vmalloc;
struct hns_roce_hem *hem;
struct scatterlist *mem;
int order;
@@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
chunk->npages = 0;
chunk->nsg = 0;
+ memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
list_add_tail(&chunk->list, &hem->chunk_list);
}
@@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
if (!buf)
goto fail;
- sg_set_buf(mem, buf, PAGE_SIZE << order);
+ if (is_vmalloc_addr(buf)) {
+ vmalloc = &chunk->vmalloc[chunk->npages];
+ vmalloc->is_vmalloc_addr = true;
+ vmalloc->vmalloc_addr = buf;
+ sg_set_page(mem, vmalloc_to_page(buf),
+ PAGE_SIZE << order, offset_in_page(buf));
+ } else {
+ sg_set_buf(mem, buf, PAGE_SIZE << order);
+ }
WARN_ON(mem->offset);
sg_dma_len(mem) = PAGE_SIZE << order;
@@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
{
struct hns_roce_hem_chunk *chunk, *tmp;
+ void *cpu_addr;
int i;
if (!hem)
return;
list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
- for (i = 0; i < chunk->npages; ++i)
+ for (i = 0; i < chunk->npages; ++i) {
+ if (chunk->vmalloc[i].is_vmalloc_addr)
+ cpu_addr = chunk->vmalloc[i].vmalloc_addr;
+ else
+ cpu_addr =
+ lowmem_page_address(sg_page(&chunk->mem[i]));
+
dma_free_coherent(hr_dev->dev,
chunk->mem[i].length,
- lowmem_page_address(sg_page(&chunk->mem[i])),
+ cpu_addr,
sg_dma_address(&chunk->mem[i]));
+ }
kfree(chunk);
}
@@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
if (chunk->mem[i].length > (u32)offset) {
page = sg_page(&chunk->mem[i]);
+ if (chunk->vmalloc[i].is_vmalloc_addr) {
+ mutex_unlock(&table->mutex);
+ return page ?
+ chunk->vmalloc[i].vmalloc_addr
+ + offset : NULL;
+ }
goto out;
}
offset -= chunk->mem[i].length;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
index af28bbf..62d712a 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hem.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
@@ -72,11 +72,17 @@ enum {
HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
};
+struct hns_roce_vmalloc {
+ bool is_vmalloc_addr;
+ void *vmalloc_addr;
+};
+
struct hns_roce_hem_chunk {
struct list_head list;
int npages;
int nsg;
struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
+ struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
};
struct hns_roce_hem {
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index b99d70a..9e19bf1 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
{
struct hns_roce_v2_mpt_entry *mpt_entry;
struct scatterlist *sg;
+ u64 page_addr = 0;
u64 *pages;
+ int i = 0, j = 0;
+ int len = 0;
int entry;
- int i;
mpt_entry = mb_buf;
memset(mpt_entry, 0, sizeof(*mpt_entry));
@@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
i = 0;
for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
- pages[i] = ((u64)sg_dma_address(sg)) >> 6;
-
- /* Record the first 2 entry directly to MTPT table */
- if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
- break;
- i++;
+ len = sg_dma_len(sg) >> PAGE_SHIFT;
+ for (j = 0; j < len; ++j) {
+ page_addr = sg_dma_address(sg) +
+ (j << mr->umem->page_shift);
+ pages[i] = page_addr >> 6;
+
+ /* Record the first 2 entry directly to MTPT table */
+ if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
+ goto found;
+ i++;
+ }
}
+found:
mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
V2_MPT_BYTE_56_PA0_H_S,
--
1.9.1
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH for-next 3/4] RDMA/hns: Update the IRRL table chunk size in hip08
2017-09-30 9:28 ` Wei Hu (Xavier)
@ 2017-09-30 9:29 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:29 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
xavier.huwei-hv44wF8Li93QT0dZR+AlfA, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
xushaobo2-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, xavier.huwei-WVlzvzqoTvw,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, shaobohsu-9Onoh4P/yGk,
shaoboxu-WVlzvzqoTvw
As the increase of the IRRL specification in hip08, the IRRL table
chunk size needs to be updated.
This patch updates the IRRL table chunk size to 256k for hip08.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
drivers/infiniband/hw/hns/hns_roce_device.h | 3 +++
drivers/infiniband/hw/hns/hns_roce_hem.c | 31 ++++++++++++++---------------
drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 ++
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 1 +
drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 2 ++
6 files changed, 24 insertions(+), 16 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index 9353400..fc2a53d 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -236,6 +236,8 @@ struct hns_roce_hem_table {
unsigned long num_obj;
/*Single obj size */
unsigned long obj_size;
+ unsigned long table_chunk_size;
+ unsigned long hem_alloc_size;
int lowmem;
struct mutex mutex;
struct hns_roce_hem **hem;
@@ -565,6 +567,7 @@ struct hns_roce_caps {
u32 cqe_ba_pg_sz;
u32 cqe_buf_pg_sz;
u32 cqe_hop_num;
+ u32 chunk_sz; /* chunk size in non multihop mode*/
};
struct hns_roce_hw {
diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
index 4a3d1d4..c08bc16 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hem.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
@@ -36,9 +36,6 @@
#include "hns_roce_hem.h"
#include "hns_roce_common.h"
-#define HNS_ROCE_HEM_ALLOC_SIZE (1 << 17)
-#define HNS_ROCE_TABLE_CHUNK_SIZE (1 << 17)
-
#define DMA_ADDR_T_SHIFT 12
#define BT_BA_SHIFT 32
@@ -314,7 +311,7 @@ static int hns_roce_set_hem(struct hns_roce_dev *hr_dev,
/* Find the HEM(Hardware Entry Memory) entry */
unsigned long i = (obj & (table->num_obj - 1)) /
- (HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size);
+ (table->table_chunk_size / table->obj_size);
switch (table->type) {
case HEM_TYPE_QPC:
@@ -559,7 +556,7 @@ int hns_roce_table_get(struct hns_roce_dev *hr_dev,
if (hns_roce_check_whether_mhop(hr_dev, table->type))
return hns_roce_table_mhop_get(hr_dev, table, obj);
- i = (obj & (table->num_obj - 1)) / (HNS_ROCE_TABLE_CHUNK_SIZE /
+ i = (obj & (table->num_obj - 1)) / (table->table_chunk_size /
table->obj_size);
mutex_lock(&table->mutex);
@@ -570,8 +567,8 @@ int hns_roce_table_get(struct hns_roce_dev *hr_dev,
}
table->hem[i] = hns_roce_alloc_hem(hr_dev,
- HNS_ROCE_TABLE_CHUNK_SIZE >> PAGE_SHIFT,
- HNS_ROCE_HEM_ALLOC_SIZE,
+ table->table_chunk_size >> PAGE_SHIFT,
+ table->hem_alloc_size,
(table->lowmem ? GFP_KERNEL :
GFP_HIGHUSER) | __GFP_NOWARN);
if (!table->hem[i]) {
@@ -720,7 +717,7 @@ void hns_roce_table_put(struct hns_roce_dev *hr_dev,
}
i = (obj & (table->num_obj - 1)) /
- (HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size);
+ (table->table_chunk_size / table->obj_size);
mutex_lock(&table->mutex);
@@ -757,8 +754,8 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
if (!hns_roce_check_whether_mhop(hr_dev, table->type)) {
idx = (obj & (table->num_obj - 1)) * table->obj_size;
- hem = table->hem[idx / HNS_ROCE_TABLE_CHUNK_SIZE];
- dma_offset = offset = idx % HNS_ROCE_TABLE_CHUNK_SIZE;
+ hem = table->hem[idx / table->table_chunk_size];
+ dma_offset = offset = idx % table->table_chunk_size;
} else {
hns_roce_calc_hem_mhop(hr_dev, table, &mhop_obj, &mhop);
/* mtt mhop */
@@ -815,7 +812,7 @@ int hns_roce_table_get_range(struct hns_roce_dev *hr_dev,
unsigned long start, unsigned long end)
{
struct hns_roce_hem_mhop mhop;
- unsigned long inc = HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size;
+ unsigned long inc = table->table_chunk_size / table->obj_size;
unsigned long i;
int ret;
@@ -846,7 +843,7 @@ void hns_roce_table_put_range(struct hns_roce_dev *hr_dev,
unsigned long start, unsigned long end)
{
struct hns_roce_hem_mhop mhop;
- unsigned long inc = HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size;
+ unsigned long inc = table->table_chunk_size / table->obj_size;
unsigned long i;
if (hns_roce_check_whether_mhop(hr_dev, table->type)) {
@@ -854,8 +851,7 @@ void hns_roce_table_put_range(struct hns_roce_dev *hr_dev,
inc = mhop.bt_chunk_size / table->obj_size;
}
- for (i = start; i <= end;
- i += inc)
+ for (i = start; i <= end; i += inc)
hns_roce_table_put(hr_dev, table, i);
}
@@ -869,7 +865,10 @@ int hns_roce_init_hem_table(struct hns_roce_dev *hr_dev,
unsigned long num_hem;
if (!hns_roce_check_whether_mhop(hr_dev, type)) {
- obj_per_chunk = HNS_ROCE_TABLE_CHUNK_SIZE / obj_size;
+ table->table_chunk_size = hr_dev->caps.chunk_sz;
+ table->hem_alloc_size = hr_dev->caps.chunk_sz;
+
+ obj_per_chunk = table->table_chunk_size / obj_size;
num_hem = (nobj + obj_per_chunk - 1) / obj_per_chunk;
table->hem = kcalloc(num_hem, sizeof(*table->hem), GFP_KERNEL);
@@ -1051,7 +1050,7 @@ void hns_roce_cleanup_hem_table(struct hns_roce_dev *hr_dev,
for (i = 0; i < table->num_hem; ++i)
if (table->hem[i]) {
if (hr_dev->hw->clear_hem(hr_dev, table,
- i * HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size, 0))
+ i * table->table_chunk_size / table->obj_size, 0))
dev_err(dev, "Clear HEM base address failed.\n");
hns_roce_free_hem(hr_dev, table->hem[i]);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
index 852db18..47ff1c9 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
@@ -1513,6 +1513,7 @@ int hns_roce_v1_profile(struct hns_roce_dev *hr_dev)
caps->reserved_mrws = 1;
caps->reserved_uars = 0;
caps->reserved_cqs = 0;
+ caps->chunk_sz = HNS_ROCE_V1_TABLE_CHUNK_SIZE;
for (i = 0; i < caps->num_ports; i++)
caps->pkey_table_len[i] = 1;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
index eb83ff3..21a07ef 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
@@ -72,6 +72,8 @@
#define HNS_ROCE_V1_CQE_ENTRY_SIZE 32
#define HNS_ROCE_V1_PAGE_SIZE_SUPPORT 0xFFFFF000
+#define HNS_ROCE_V1_TABLE_CHUNK_SIZE (1 << 17)
+
#define HNS_ROCE_V1_EXT_RAQ_WF 8
#define HNS_ROCE_V1_RAQ_ENTRY 64
#define HNS_ROCE_V1_RAQ_DEPTH 32768
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 9e19bf1..5a011da 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -943,6 +943,7 @@ static int hns_roce_v2_profile(struct hns_roce_dev *hr_dev)
caps->cqe_ba_pg_sz = 0;
caps->cqe_buf_pg_sz = 0;
caps->cqe_hop_num = HNS_ROCE_CQE_HOP_NUM;
+ caps->chunk_sz = HNS_ROCE_V2_TABLE_CHUNK_SIZE;
caps->pkey_table_len[0] = 1;
caps->gid_table_len[0] = 2;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
index 4fc4acd..65ed3f8 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
@@ -78,6 +78,8 @@
#define HNS_ROCE_CQE_HOP_NUM 1
#define HNS_ROCE_PBL_HOP_NUM 2
+#define HNS_ROCE_V2_TABLE_CHUNK_SIZE (1 << 18)
+
#define HNS_ROCE_CMD_FLAG_IN_VALID_SHIFT 0
#define HNS_ROCE_CMD_FLAG_OUT_VALID_SHIFT 1
#define HNS_ROCE_CMD_FLAG_NEXT_SHIFT 2
--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH for-next 3/4] RDMA/hns: Update the IRRL table chunk size in hip08
@ 2017-09-30 9:29 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:29 UTC (permalink / raw)
To: dledford
Cc: linux-rdma, xavier.huwei, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
As the increase of the IRRL specification in hip08, the IRRL table
chunk size needs to be updated.
This patch updates the IRRL table chunk size to 256k for hip08.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
---
drivers/infiniband/hw/hns/hns_roce_device.h | 3 +++
drivers/infiniband/hw/hns/hns_roce_hem.c | 31 ++++++++++++++---------------
drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 ++
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 1 +
drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 2 ++
6 files changed, 24 insertions(+), 16 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index 9353400..fc2a53d 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -236,6 +236,8 @@ struct hns_roce_hem_table {
unsigned long num_obj;
/*Single obj size */
unsigned long obj_size;
+ unsigned long table_chunk_size;
+ unsigned long hem_alloc_size;
int lowmem;
struct mutex mutex;
struct hns_roce_hem **hem;
@@ -565,6 +567,7 @@ struct hns_roce_caps {
u32 cqe_ba_pg_sz;
u32 cqe_buf_pg_sz;
u32 cqe_hop_num;
+ u32 chunk_sz; /* chunk size in non multihop mode*/
};
struct hns_roce_hw {
diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
index 4a3d1d4..c08bc16 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hem.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
@@ -36,9 +36,6 @@
#include "hns_roce_hem.h"
#include "hns_roce_common.h"
-#define HNS_ROCE_HEM_ALLOC_SIZE (1 << 17)
-#define HNS_ROCE_TABLE_CHUNK_SIZE (1 << 17)
-
#define DMA_ADDR_T_SHIFT 12
#define BT_BA_SHIFT 32
@@ -314,7 +311,7 @@ static int hns_roce_set_hem(struct hns_roce_dev *hr_dev,
/* Find the HEM(Hardware Entry Memory) entry */
unsigned long i = (obj & (table->num_obj - 1)) /
- (HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size);
+ (table->table_chunk_size / table->obj_size);
switch (table->type) {
case HEM_TYPE_QPC:
@@ -559,7 +556,7 @@ int hns_roce_table_get(struct hns_roce_dev *hr_dev,
if (hns_roce_check_whether_mhop(hr_dev, table->type))
return hns_roce_table_mhop_get(hr_dev, table, obj);
- i = (obj & (table->num_obj - 1)) / (HNS_ROCE_TABLE_CHUNK_SIZE /
+ i = (obj & (table->num_obj - 1)) / (table->table_chunk_size /
table->obj_size);
mutex_lock(&table->mutex);
@@ -570,8 +567,8 @@ int hns_roce_table_get(struct hns_roce_dev *hr_dev,
}
table->hem[i] = hns_roce_alloc_hem(hr_dev,
- HNS_ROCE_TABLE_CHUNK_SIZE >> PAGE_SHIFT,
- HNS_ROCE_HEM_ALLOC_SIZE,
+ table->table_chunk_size >> PAGE_SHIFT,
+ table->hem_alloc_size,
(table->lowmem ? GFP_KERNEL :
GFP_HIGHUSER) | __GFP_NOWARN);
if (!table->hem[i]) {
@@ -720,7 +717,7 @@ void hns_roce_table_put(struct hns_roce_dev *hr_dev,
}
i = (obj & (table->num_obj - 1)) /
- (HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size);
+ (table->table_chunk_size / table->obj_size);
mutex_lock(&table->mutex);
@@ -757,8 +754,8 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
if (!hns_roce_check_whether_mhop(hr_dev, table->type)) {
idx = (obj & (table->num_obj - 1)) * table->obj_size;
- hem = table->hem[idx / HNS_ROCE_TABLE_CHUNK_SIZE];
- dma_offset = offset = idx % HNS_ROCE_TABLE_CHUNK_SIZE;
+ hem = table->hem[idx / table->table_chunk_size];
+ dma_offset = offset = idx % table->table_chunk_size;
} else {
hns_roce_calc_hem_mhop(hr_dev, table, &mhop_obj, &mhop);
/* mtt mhop */
@@ -815,7 +812,7 @@ int hns_roce_table_get_range(struct hns_roce_dev *hr_dev,
unsigned long start, unsigned long end)
{
struct hns_roce_hem_mhop mhop;
- unsigned long inc = HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size;
+ unsigned long inc = table->table_chunk_size / table->obj_size;
unsigned long i;
int ret;
@@ -846,7 +843,7 @@ void hns_roce_table_put_range(struct hns_roce_dev *hr_dev,
unsigned long start, unsigned long end)
{
struct hns_roce_hem_mhop mhop;
- unsigned long inc = HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size;
+ unsigned long inc = table->table_chunk_size / table->obj_size;
unsigned long i;
if (hns_roce_check_whether_mhop(hr_dev, table->type)) {
@@ -854,8 +851,7 @@ void hns_roce_table_put_range(struct hns_roce_dev *hr_dev,
inc = mhop.bt_chunk_size / table->obj_size;
}
- for (i = start; i <= end;
- i += inc)
+ for (i = start; i <= end; i += inc)
hns_roce_table_put(hr_dev, table, i);
}
@@ -869,7 +865,10 @@ int hns_roce_init_hem_table(struct hns_roce_dev *hr_dev,
unsigned long num_hem;
if (!hns_roce_check_whether_mhop(hr_dev, type)) {
- obj_per_chunk = HNS_ROCE_TABLE_CHUNK_SIZE / obj_size;
+ table->table_chunk_size = hr_dev->caps.chunk_sz;
+ table->hem_alloc_size = hr_dev->caps.chunk_sz;
+
+ obj_per_chunk = table->table_chunk_size / obj_size;
num_hem = (nobj + obj_per_chunk - 1) / obj_per_chunk;
table->hem = kcalloc(num_hem, sizeof(*table->hem), GFP_KERNEL);
@@ -1051,7 +1050,7 @@ void hns_roce_cleanup_hem_table(struct hns_roce_dev *hr_dev,
for (i = 0; i < table->num_hem; ++i)
if (table->hem[i]) {
if (hr_dev->hw->clear_hem(hr_dev, table,
- i * HNS_ROCE_TABLE_CHUNK_SIZE / table->obj_size, 0))
+ i * table->table_chunk_size / table->obj_size, 0))
dev_err(dev, "Clear HEM base address failed.\n");
hns_roce_free_hem(hr_dev, table->hem[i]);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
index 852db18..47ff1c9 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
@@ -1513,6 +1513,7 @@ int hns_roce_v1_profile(struct hns_roce_dev *hr_dev)
caps->reserved_mrws = 1;
caps->reserved_uars = 0;
caps->reserved_cqs = 0;
+ caps->chunk_sz = HNS_ROCE_V1_TABLE_CHUNK_SIZE;
for (i = 0; i < caps->num_ports; i++)
caps->pkey_table_len[i] = 1;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
index eb83ff3..21a07ef 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
@@ -72,6 +72,8 @@
#define HNS_ROCE_V1_CQE_ENTRY_SIZE 32
#define HNS_ROCE_V1_PAGE_SIZE_SUPPORT 0xFFFFF000
+#define HNS_ROCE_V1_TABLE_CHUNK_SIZE (1 << 17)
+
#define HNS_ROCE_V1_EXT_RAQ_WF 8
#define HNS_ROCE_V1_RAQ_ENTRY 64
#define HNS_ROCE_V1_RAQ_DEPTH 32768
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 9e19bf1..5a011da 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -943,6 +943,7 @@ static int hns_roce_v2_profile(struct hns_roce_dev *hr_dev)
caps->cqe_ba_pg_sz = 0;
caps->cqe_buf_pg_sz = 0;
caps->cqe_hop_num = HNS_ROCE_CQE_HOP_NUM;
+ caps->chunk_sz = HNS_ROCE_V2_TABLE_CHUNK_SIZE;
caps->pkey_table_len[0] = 1;
caps->gid_table_len[0] = 2;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
index 4fc4acd..65ed3f8 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
@@ -78,6 +78,8 @@
#define HNS_ROCE_CQE_HOP_NUM 1
#define HNS_ROCE_PBL_HOP_NUM 2
+#define HNS_ROCE_V2_TABLE_CHUNK_SIZE (1 << 18)
+
#define HNS_ROCE_CMD_FLAG_IN_VALID_SHIFT 0
#define HNS_ROCE_CMD_FLAG_OUT_VALID_SHIFT 1
#define HNS_ROCE_CMD_FLAG_NEXT_SHIFT 2
--
1.9.1
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH for-next 4/4] RDMA/hns: Update the PD&CQE&MTT specification in hip08
2017-09-30 9:28 ` Wei Hu (Xavier)
@ 2017-09-30 9:29 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:29 UTC (permalink / raw)
To: dledford
Cc: linux-rdma, xavier.huwei, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
This patch updates the PD specification to 16M for hip08. And it
updates the numbers of mtt and cqe segments for the buddy.
As the CQE supports hop num 1 addressing, the CQE specification is
64k. This patch updates to set the CQE specification to 64k.
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
---
drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
index 65ed3f8..6106ad1 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
@@ -47,16 +47,16 @@
#define HNS_ROCE_V2_MAX_QP_NUM 0x2000
#define HNS_ROCE_V2_MAX_WQE_NUM 0x8000
#define HNS_ROCE_V2_MAX_CQ_NUM 0x8000
-#define HNS_ROCE_V2_MAX_CQE_NUM 0x400000
+#define HNS_ROCE_V2_MAX_CQE_NUM 0x10000
#define HNS_ROCE_V2_MAX_RQ_SGE_NUM 0x100
#define HNS_ROCE_V2_MAX_SQ_SGE_NUM 0xff
#define HNS_ROCE_V2_MAX_SQ_INLINE 0x20
#define HNS_ROCE_V2_UAR_NUM 256
#define HNS_ROCE_V2_PHY_UAR_NUM 1
#define HNS_ROCE_V2_MAX_MTPT_NUM 0x8000
-#define HNS_ROCE_V2_MAX_MTT_SEGS 0x100000
-#define HNS_ROCE_V2_MAX_CQE_SEGS 0x10000
-#define HNS_ROCE_V2_MAX_PD_NUM 0x400000
+#define HNS_ROCE_V2_MAX_MTT_SEGS 0x1000000
+#define HNS_ROCE_V2_MAX_CQE_SEGS 0x1000000
+#define HNS_ROCE_V2_MAX_PD_NUM 0x1000000
#define HNS_ROCE_V2_MAX_QP_INIT_RDMA 128
#define HNS_ROCE_V2_MAX_QP_DEST_RDMA 128
#define HNS_ROCE_V2_MAX_SQ_DESC_SZ 64
--
1.9.1
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH for-next 4/4] RDMA/hns: Update the PD&CQE&MTT specification in hip08
@ 2017-09-30 9:29 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-09-30 9:29 UTC (permalink / raw)
To: dledford
Cc: linux-rdma, xavier.huwei, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
This patch updates the PD specification to 16M for hip08. And it
updates the numbers of mtt and cqe segments for the buddy.
As the CQE supports hop num 1 addressing, the CQE specification is
64k. This patch updates to set the CQE specification to 64k.
Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
---
drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
index 65ed3f8..6106ad1 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h
@@ -47,16 +47,16 @@
#define HNS_ROCE_V2_MAX_QP_NUM 0x2000
#define HNS_ROCE_V2_MAX_WQE_NUM 0x8000
#define HNS_ROCE_V2_MAX_CQ_NUM 0x8000
-#define HNS_ROCE_V2_MAX_CQE_NUM 0x400000
+#define HNS_ROCE_V2_MAX_CQE_NUM 0x10000
#define HNS_ROCE_V2_MAX_RQ_SGE_NUM 0x100
#define HNS_ROCE_V2_MAX_SQ_SGE_NUM 0xff
#define HNS_ROCE_V2_MAX_SQ_INLINE 0x20
#define HNS_ROCE_V2_UAR_NUM 256
#define HNS_ROCE_V2_PHY_UAR_NUM 1
#define HNS_ROCE_V2_MAX_MTPT_NUM 0x8000
-#define HNS_ROCE_V2_MAX_MTT_SEGS 0x100000
-#define HNS_ROCE_V2_MAX_CQE_SEGS 0x10000
-#define HNS_ROCE_V2_MAX_PD_NUM 0x400000
+#define HNS_ROCE_V2_MAX_MTT_SEGS 0x1000000
+#define HNS_ROCE_V2_MAX_CQE_SEGS 0x1000000
+#define HNS_ROCE_V2_MAX_PD_NUM 0x1000000
#define HNS_ROCE_V2_MAX_QP_INIT_RDMA 128
#define HNS_ROCE_V2_MAX_QP_DEST_RDMA 128
#define HNS_ROCE_V2_MAX_SQ_DESC_SZ 64
--
1.9.1
^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-09-30 9:28 ` Wei Hu (Xavier)
@ 2017-09-30 16:10 ` Leon Romanovsky
-1 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-09-30 16:10 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
xushaobo2-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, xavier.huwei-WVlzvzqoTvw,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, shaobohsu-9Onoh4P/yGk,
shaoboxu-WVlzvzqoTvw
[-- Attachment #1: Type: text/plain, Size: 6802 bytes --]
On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
> If the IOMMU is enabled, the length of sg obtained from
> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
> dma address, the IOVA will not be page continuous. and the VA
> returned from dma_alloc_coherent is a vmalloc address. However,
> the VA obtained by the page_address is a discontinuous VA. Under
> these circumstances, the IOVA should be calculated based on the
> sg length, and record the VA returned from dma_alloc_coherent
> in the struct of hem.
>
> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
Doug,
I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
driver code to deal with dma_alloc_coherent is most probably wrong.
Thanks
> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
> drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
> 4 files changed, 52 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> index 3e4c525..a69cd4b 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> - pages[i] = virt_to_page(buf->page_list[i].buf);
> + pages[i] =
> + is_vmalloc_addr(buf->page_list[i].buf) ?
> + vmalloc_to_page(buf->page_list[i].buf) :
> + virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
> index 8388ae2..4a3d1d4 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
> @@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> gfp_t gfp_mask)
> {
> struct hns_roce_hem_chunk *chunk = NULL;
> + struct hns_roce_vmalloc *vmalloc;
> struct hns_roce_hem *hem;
> struct scatterlist *mem;
> int order;
> @@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
> chunk->npages = 0;
> chunk->nsg = 0;
> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
> list_add_tail(&chunk->list, &hem->chunk_list);
> }
>
> @@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> if (!buf)
> goto fail;
>
> - sg_set_buf(mem, buf, PAGE_SIZE << order);
> + if (is_vmalloc_addr(buf)) {
> + vmalloc = &chunk->vmalloc[chunk->npages];
> + vmalloc->is_vmalloc_addr = true;
> + vmalloc->vmalloc_addr = buf;
> + sg_set_page(mem, vmalloc_to_page(buf),
> + PAGE_SIZE << order, offset_in_page(buf));
> + } else {
> + sg_set_buf(mem, buf, PAGE_SIZE << order);
> + }
> WARN_ON(mem->offset);
> sg_dma_len(mem) = PAGE_SIZE << order;
>
> @@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
> {
> struct hns_roce_hem_chunk *chunk, *tmp;
> + void *cpu_addr;
> int i;
>
> if (!hem)
> return;
>
> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
> - for (i = 0; i < chunk->npages; ++i)
> + for (i = 0; i < chunk->npages; ++i) {
> + if (chunk->vmalloc[i].is_vmalloc_addr)
> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
> + else
> + cpu_addr =
> + lowmem_page_address(sg_page(&chunk->mem[i]));
> +
> dma_free_coherent(hr_dev->dev,
> chunk->mem[i].length,
> - lowmem_page_address(sg_page(&chunk->mem[i])),
> + cpu_addr,
> sg_dma_address(&chunk->mem[i]));
> + }
> kfree(chunk);
> }
>
> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
>
> if (chunk->mem[i].length > (u32)offset) {
> page = sg_page(&chunk->mem[i]);
> + if (chunk->vmalloc[i].is_vmalloc_addr) {
> + mutex_unlock(&table->mutex);
> + return page ?
> + chunk->vmalloc[i].vmalloc_addr
> + + offset : NULL;
> + }
> goto out;
> }
> offset -= chunk->mem[i].length;
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
> index af28bbf..62d712a 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
> @@ -72,11 +72,17 @@ enum {
> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
> };
>
> +struct hns_roce_vmalloc {
> + bool is_vmalloc_addr;
> + void *vmalloc_addr;
> +};
> +
> struct hns_roce_hem_chunk {
> struct list_head list;
> int npages;
> int nsg;
> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
> };
>
> struct hns_roce_hem {
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> index b99d70a..9e19bf1 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
> {
> struct hns_roce_v2_mpt_entry *mpt_entry;
> struct scatterlist *sg;
> + u64 page_addr = 0;
> u64 *pages;
> + int i = 0, j = 0;
> + int len = 0;
> int entry;
> - int i;
>
> mpt_entry = mb_buf;
> memset(mpt_entry, 0, sizeof(*mpt_entry));
> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>
> i = 0;
> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
> -
> - /* Record the first 2 entry directly to MTPT table */
> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> - break;
> - i++;
> + len = sg_dma_len(sg) >> PAGE_SHIFT;
> + for (j = 0; j < len; ++j) {
> + page_addr = sg_dma_address(sg) +
> + (j << mr->umem->page_shift);
> + pages[i] = page_addr >> 6;
> +
> + /* Record the first 2 entry directly to MTPT table */
> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> + goto found;
> + i++;
> + }
> }
>
> +found:
> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
> V2_MPT_BYTE_56_PA0_H_S,
> --
> 1.9.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-09-30 16:10 ` Leon Romanovsky
0 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-09-30 16:10 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
[-- Attachment #1: Type: text/plain, Size: 6715 bytes --]
On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
> If the IOMMU is enabled, the length of sg obtained from
> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
> dma address, the IOVA will not be page continuous. and the VA
> returned from dma_alloc_coherent is a vmalloc address. However,
> the VA obtained by the page_address is a discontinuous VA. Under
> these circumstances, the IOVA should be calculated based on the
> sg length, and record the VA returned from dma_alloc_coherent
> in the struct of hem.
>
> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
> Signed-off-by: Lijun Ou <oulijun@huawei.com>
> ---
Doug,
I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
driver code to deal with dma_alloc_coherent is most probably wrong.
Thanks
> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
> drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
> 4 files changed, 52 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> index 3e4c525..a69cd4b 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> - pages[i] = virt_to_page(buf->page_list[i].buf);
> + pages[i] =
> + is_vmalloc_addr(buf->page_list[i].buf) ?
> + vmalloc_to_page(buf->page_list[i].buf) :
> + virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
> index 8388ae2..4a3d1d4 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
> @@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> gfp_t gfp_mask)
> {
> struct hns_roce_hem_chunk *chunk = NULL;
> + struct hns_roce_vmalloc *vmalloc;
> struct hns_roce_hem *hem;
> struct scatterlist *mem;
> int order;
> @@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
> chunk->npages = 0;
> chunk->nsg = 0;
> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
> list_add_tail(&chunk->list, &hem->chunk_list);
> }
>
> @@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> if (!buf)
> goto fail;
>
> - sg_set_buf(mem, buf, PAGE_SIZE << order);
> + if (is_vmalloc_addr(buf)) {
> + vmalloc = &chunk->vmalloc[chunk->npages];
> + vmalloc->is_vmalloc_addr = true;
> + vmalloc->vmalloc_addr = buf;
> + sg_set_page(mem, vmalloc_to_page(buf),
> + PAGE_SIZE << order, offset_in_page(buf));
> + } else {
> + sg_set_buf(mem, buf, PAGE_SIZE << order);
> + }
> WARN_ON(mem->offset);
> sg_dma_len(mem) = PAGE_SIZE << order;
>
> @@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
> {
> struct hns_roce_hem_chunk *chunk, *tmp;
> + void *cpu_addr;
> int i;
>
> if (!hem)
> return;
>
> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
> - for (i = 0; i < chunk->npages; ++i)
> + for (i = 0; i < chunk->npages; ++i) {
> + if (chunk->vmalloc[i].is_vmalloc_addr)
> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
> + else
> + cpu_addr =
> + lowmem_page_address(sg_page(&chunk->mem[i]));
> +
> dma_free_coherent(hr_dev->dev,
> chunk->mem[i].length,
> - lowmem_page_address(sg_page(&chunk->mem[i])),
> + cpu_addr,
> sg_dma_address(&chunk->mem[i]));
> + }
> kfree(chunk);
> }
>
> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
>
> if (chunk->mem[i].length > (u32)offset) {
> page = sg_page(&chunk->mem[i]);
> + if (chunk->vmalloc[i].is_vmalloc_addr) {
> + mutex_unlock(&table->mutex);
> + return page ?
> + chunk->vmalloc[i].vmalloc_addr
> + + offset : NULL;
> + }
> goto out;
> }
> offset -= chunk->mem[i].length;
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
> index af28bbf..62d712a 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
> @@ -72,11 +72,17 @@ enum {
> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
> };
>
> +struct hns_roce_vmalloc {
> + bool is_vmalloc_addr;
> + void *vmalloc_addr;
> +};
> +
> struct hns_roce_hem_chunk {
> struct list_head list;
> int npages;
> int nsg;
> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
> };
>
> struct hns_roce_hem {
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> index b99d70a..9e19bf1 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
> {
> struct hns_roce_v2_mpt_entry *mpt_entry;
> struct scatterlist *sg;
> + u64 page_addr = 0;
> u64 *pages;
> + int i = 0, j = 0;
> + int len = 0;
> int entry;
> - int i;
>
> mpt_entry = mb_buf;
> memset(mpt_entry, 0, sizeof(*mpt_entry));
> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>
> i = 0;
> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
> -
> - /* Record the first 2 entry directly to MTPT table */
> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> - break;
> - i++;
> + len = sg_dma_len(sg) >> PAGE_SHIFT;
> + for (j = 0; j < len; ++j) {
> + page_addr = sg_dma_address(sg) +
> + (j << mr->umem->page_shift);
> + pages[i] = page_addr >> 6;
> +
> + /* Record the first 2 entry directly to MTPT table */
> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> + goto found;
> + i++;
> + }
> }
>
> +found:
> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
> V2_MPT_BYTE_56_PA0_H_S,
> --
> 1.9.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 3/4] RDMA/hns: Update the IRRL table chunk size in hip08
2017-09-30 9:29 ` Wei Hu (Xavier)
@ 2017-10-01 5:40 ` Leon Romanovsky
-1 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-10-01 5:40 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
xushaobo2-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, xavier.huwei-WVlzvzqoTvw,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, shaobohsu-9Onoh4P/yGk,
shaoboxu-WVlzvzqoTvw
[-- Attachment #1: Type: text/plain, Size: 1912 bytes --]
On Sat, Sep 30, 2017 at 05:29:00PM +0800, Wei Hu (Xavier) wrote:
> As the increase of the IRRL specification in hip08, the IRRL table
> chunk size needs to be updated.
> This patch updates the IRRL table chunk size to 256k for hip08.
>
> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
> drivers/infiniband/hw/hns/hns_roce_device.h | 3 +++
> drivers/infiniband/hw/hns/hns_roce_hem.c | 31 ++++++++++++++---------------
> drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
> drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 ++
> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 1 +
> drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 2 ++
> 6 files changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
> index 9353400..fc2a53d 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_device.h
> +++ b/drivers/infiniband/hw/hns/hns_roce_device.h
> @@ -236,6 +236,8 @@ struct hns_roce_hem_table {
> unsigned long num_obj;
> /*Single obj size */
> unsigned long obj_size;
> + unsigned long table_chunk_size;
> + unsigned long hem_alloc_size;
> int lowmem;
> struct mutex mutex;
> struct hns_roce_hem **hem;
> @@ -565,6 +567,7 @@ struct hns_roce_caps {
> u32 cqe_ba_pg_sz;
> u32 cqe_buf_pg_sz;
> u32 cqe_hop_num;
> + u32 chunk_sz; /* chunk size in non multihop mode*/
> };
Hi,
I have two comments:
1. In this code table_chunk_size is equal and similar to hem_alloc_size.
Please don't introduce unneeded complexity.
2. The size of table is num_obj * obj_size, there is no need to
table_chunk_size and hem_alloc_size at all. There are plenty of macros in
the kernel to deal with the tables.
Thanks
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 3/4] RDMA/hns: Update the IRRL table chunk size in hip08
@ 2017-10-01 5:40 ` Leon Romanovsky
0 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-10-01 5:40 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
[-- Attachment #1: Type: text/plain, Size: 1825 bytes --]
On Sat, Sep 30, 2017 at 05:29:00PM +0800, Wei Hu (Xavier) wrote:
> As the increase of the IRRL specification in hip08, the IRRL table
> chunk size needs to be updated.
> This patch updates the IRRL table chunk size to 256k for hip08.
>
> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
> Signed-off-by: Lijun Ou <oulijun@huawei.com>
> ---
> drivers/infiniband/hw/hns/hns_roce_device.h | 3 +++
> drivers/infiniband/hw/hns/hns_roce_hem.c | 31 ++++++++++++++---------------
> drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
> drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 ++
> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 1 +
> drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 2 ++
> 6 files changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
> index 9353400..fc2a53d 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_device.h
> +++ b/drivers/infiniband/hw/hns/hns_roce_device.h
> @@ -236,6 +236,8 @@ struct hns_roce_hem_table {
> unsigned long num_obj;
> /*Single obj size */
> unsigned long obj_size;
> + unsigned long table_chunk_size;
> + unsigned long hem_alloc_size;
> int lowmem;
> struct mutex mutex;
> struct hns_roce_hem **hem;
> @@ -565,6 +567,7 @@ struct hns_roce_caps {
> u32 cqe_ba_pg_sz;
> u32 cqe_buf_pg_sz;
> u32 cqe_hop_num;
> + u32 chunk_sz; /* chunk size in non multihop mode*/
> };
Hi,
I have two comments:
1. In this code table_chunk_size is equal and similar to hem_alloc_size.
Please don't introduce unneeded complexity.
2. The size of table is num_obj * obj_size, there is no need to
table_chunk_size and hem_alloc_size at all. There are plenty of macros in
the kernel to deal with the tables.
Thanks
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-09-30 16:10 ` Leon Romanovsky
@ 2017-10-12 12:31 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-12 12:31 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, linux-mm, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobo.xu, shaoboxu, leizhen 00275356, joro, iommu
On 2017/10/1 0:10, Leon Romanovsky wrote:
> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>> If the IOMMU is enabled, the length of sg obtained from
>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>> dma address, the IOVA will not be page continuous. and the VA
>> returned from dma_alloc_coherent is a vmalloc address. However,
>> the VA obtained by the page_address is a discontinuous VA. Under
>> these circumstances, the IOVA should be calculated based on the
>> sg length, and record the VA returned from dma_alloc_coherent
>> in the struct of hem.
>>
>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>> ---
> Doug,
>
> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> driver code to deal with dma_alloc_coherent is most probably wrong.
>
> Thanks
Hi, Leon & Doug
We refered the function named __ttm_dma_alloc_page in the kernel
code as below:
And there are similar methods in bch_bio_map and mem_to_page
functions in current 4.14-rcx.
static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
{
struct dma_page *d_page;
d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
if (!d_page)
return NULL;
d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
&d_page->dma,
pool->gfp_flags);
if (d_page->vaddr) {
if (is_vmalloc_addr(d_page->vaddr))
d_page->p = vmalloc_to_page(d_page->vaddr);
else
d_page->p = virt_to_page(d_page->vaddr);
} else {
kfree(d_page);
d_page = NULL;
}
return d_page;
}
Regards
Wei Hu
>
>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> index 3e4c525..a69cd4b 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>> + pages[i] =
>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>> + vmalloc_to_page(buf->page_list[i].buf) :
>> + virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> index 8388ae2..4a3d1d4 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> @@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> gfp_t gfp_mask)
>> {
>> struct hns_roce_hem_chunk *chunk = NULL;
>> + struct hns_roce_vmalloc *vmalloc;
>> struct hns_roce_hem *hem;
>> struct scatterlist *mem;
>> int order;
>> @@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>> chunk->npages = 0;
>> chunk->nsg = 0;
>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>> list_add_tail(&chunk->list, &hem->chunk_list);
>> }
>>
>> @@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> if (!buf)
>> goto fail;
>>
>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + if (is_vmalloc_addr(buf)) {
>> + vmalloc = &chunk->vmalloc[chunk->npages];
>> + vmalloc->is_vmalloc_addr = true;
>> + vmalloc->vmalloc_addr = buf;
>> + sg_set_page(mem, vmalloc_to_page(buf),
>> + PAGE_SIZE << order, offset_in_page(buf));
>> + } else {
>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + }
>> WARN_ON(mem->offset);
>> sg_dma_len(mem) = PAGE_SIZE << order;
>>
>> @@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
>> {
>> struct hns_roce_hem_chunk *chunk, *tmp;
>> + void *cpu_addr;
>> int i;
>>
>> if (!hem)
>> return;
>>
>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>> - for (i = 0; i < chunk->npages; ++i)
>> + for (i = 0; i < chunk->npages; ++i) {
>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>> + else
>> + cpu_addr =
>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>> +
>> dma_free_coherent(hr_dev->dev,
>> chunk->mem[i].length,
>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>> + cpu_addr,
>> sg_dma_address(&chunk->mem[i]));
>> + }
>> kfree(chunk);
>> }
>>
>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
>>
>> if (chunk->mem[i].length > (u32)offset) {
>> page = sg_page(&chunk->mem[i]);
>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>> + mutex_unlock(&table->mutex);
>> + return page ?
>> + chunk->vmalloc[i].vmalloc_addr
>> + + offset : NULL;
>> + }
>> goto out;
>> }
>> offset -= chunk->mem[i].length;
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> index af28bbf..62d712a 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> @@ -72,11 +72,17 @@ enum {
>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>> };
>>
>> +struct hns_roce_vmalloc {
>> + bool is_vmalloc_addr;
>> + void *vmalloc_addr;
>> +};
>> +
>> struct hns_roce_hem_chunk {
>> struct list_head list;
>> int npages;
>> int nsg;
>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>> };
>>
>> struct hns_roce_hem {
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> index b99d70a..9e19bf1 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>> {
>> struct hns_roce_v2_mpt_entry *mpt_entry;
>> struct scatterlist *sg;
>> + u64 page_addr = 0;
>> u64 *pages;
>> + int i = 0, j = 0;
>> + int len = 0;
>> int entry;
>> - int i;
>>
>> mpt_entry = mb_buf;
>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>>
>> i = 0;
>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>> -
>> - /* Record the first 2 entry directly to MTPT table */
>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> - break;
>> - i++;
>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>> + for (j = 0; j < len; ++j) {
>> + page_addr = sg_dma_address(sg) +
>> + (j << mr->umem->page_shift);
>> + pages[i] = page_addr >> 6;
>> +
>> + /* Record the first 2 entry directly to MTPT table */
>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> + goto found;
>> + i++;
>> + }
>> }
>>
>> +found:
>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>> V2_MPT_BYTE_56_PA0_H_S,
>> --
>> 1.9.1
>>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-10-12 12:31 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-12 12:31 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, linux-mm, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobo.xu, shaoboxu, leizhen 00275356, joro, iommu
On 2017/10/1 0:10, Leon Romanovsky wrote:
> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>> If the IOMMU is enabled, the length of sg obtained from
>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>> dma address, the IOVA will not be page continuous. and the VA
>> returned from dma_alloc_coherent is a vmalloc address. However,
>> the VA obtained by the page_address is a discontinuous VA. Under
>> these circumstances, the IOVA should be calculated based on the
>> sg length, and record the VA returned from dma_alloc_coherent
>> in the struct of hem.
>>
>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>> ---
> Doug,
>
> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> driver code to deal with dma_alloc_coherent is most probably wrong.
>
> Thanks
Hi, Leon & Doug
We refered the function named __ttm_dma_alloc_page in the kernel
code as below:
And there are similar methods in bch_bio_map and mem_to_page
functions in current 4.14-rcx.
static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
{
struct dma_page *d_page;
d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
if (!d_page)
return NULL;
d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
&d_page->dma,
pool->gfp_flags);
if (d_page->vaddr) {
if (is_vmalloc_addr(d_page->vaddr))
d_page->p = vmalloc_to_page(d_page->vaddr);
else
d_page->p = virt_to_page(d_page->vaddr);
} else {
kfree(d_page);
d_page = NULL;
}
return d_page;
}
Regards
Wei Hu
>
>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> index 3e4c525..a69cd4b 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>> + pages[i] =
>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>> + vmalloc_to_page(buf->page_list[i].buf) :
>> + virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> index 8388ae2..4a3d1d4 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> @@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> gfp_t gfp_mask)
>> {
>> struct hns_roce_hem_chunk *chunk = NULL;
>> + struct hns_roce_vmalloc *vmalloc;
>> struct hns_roce_hem *hem;
>> struct scatterlist *mem;
>> int order;
>> @@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>> chunk->npages = 0;
>> chunk->nsg = 0;
>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>> list_add_tail(&chunk->list, &hem->chunk_list);
>> }
>>
>> @@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> if (!buf)
>> goto fail;
>>
>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + if (is_vmalloc_addr(buf)) {
>> + vmalloc = &chunk->vmalloc[chunk->npages];
>> + vmalloc->is_vmalloc_addr = true;
>> + vmalloc->vmalloc_addr = buf;
>> + sg_set_page(mem, vmalloc_to_page(buf),
>> + PAGE_SIZE << order, offset_in_page(buf));
>> + } else {
>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + }
>> WARN_ON(mem->offset);
>> sg_dma_len(mem) = PAGE_SIZE << order;
>>
>> @@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
>> {
>> struct hns_roce_hem_chunk *chunk, *tmp;
>> + void *cpu_addr;
>> int i;
>>
>> if (!hem)
>> return;
>>
>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>> - for (i = 0; i < chunk->npages; ++i)
>> + for (i = 0; i < chunk->npages; ++i) {
>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>> + else
>> + cpu_addr =
>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>> +
>> dma_free_coherent(hr_dev->dev,
>> chunk->mem[i].length,
>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>> + cpu_addr,
>> sg_dma_address(&chunk->mem[i]));
>> + }
>> kfree(chunk);
>> }
>>
>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
>>
>> if (chunk->mem[i].length > (u32)offset) {
>> page = sg_page(&chunk->mem[i]);
>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>> + mutex_unlock(&table->mutex);
>> + return page ?
>> + chunk->vmalloc[i].vmalloc_addr
>> + + offset : NULL;
>> + }
>> goto out;
>> }
>> offset -= chunk->mem[i].length;
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> index af28bbf..62d712a 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> @@ -72,11 +72,17 @@ enum {
>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>> };
>>
>> +struct hns_roce_vmalloc {
>> + bool is_vmalloc_addr;
>> + void *vmalloc_addr;
>> +};
>> +
>> struct hns_roce_hem_chunk {
>> struct list_head list;
>> int npages;
>> int nsg;
>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>> };
>>
>> struct hns_roce_hem {
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> index b99d70a..9e19bf1 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>> {
>> struct hns_roce_v2_mpt_entry *mpt_entry;
>> struct scatterlist *sg;
>> + u64 page_addr = 0;
>> u64 *pages;
>> + int i = 0, j = 0;
>> + int len = 0;
>> int entry;
>> - int i;
>>
>> mpt_entry = mb_buf;
>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>>
>> i = 0;
>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>> -
>> - /* Record the first 2 entry directly to MTPT table */
>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> - break;
>> - i++;
>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>> + for (j = 0; j < len; ++j) {
>> + page_addr = sg_dma_address(sg) +
>> + (j << mr->umem->page_shift);
>> + pages[i] = page_addr >> 6;
>> +
>> + /* Record the first 2 entry directly to MTPT table */
>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> + goto found;
>> + i++;
>> + }
>> }
>>
>> +found:
>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>> V2_MPT_BYTE_56_PA0_H_S,
>> --
>> 1.9.1
>>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-10-12 12:31 ` Wei Hu (Xavier)
(?)
@ 2017-10-12 12:59 ` Robin Murphy
-1 siblings, 0 replies; 57+ messages in thread
From: Robin Murphy @ 2017-10-12 12:59 UTC (permalink / raw)
To: Wei Hu (Xavier), Leon Romanovsky
Cc: shaobo.xu-ral2JQCrhuEAvxtiuMwx3w, xavier.huwei-WVlzvzqoTvw,
lijun_nudt-9Onoh4P/yGk, oulijun-hv44wF8Li93QT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, dledford-H+wXaHxf7aLQT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw
On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/1 0:10, Leon Romanovsky wrote:
>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>> If the IOMMU is enabled, the length of sg obtained from
>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>> dma address, the IOVA will not be page continuous. and the VA
>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>> the VA obtained by the page_address is a discontinuous VA. Under
>>> these circumstances, the IOVA should be calculated based on the
>>> sg length, and record the VA returned from dma_alloc_coherent
>>> in the struct of hem.
>>>
>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>> ---
>> Doug,
>>
>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>
>> Thanks
> Hi, Leon & Doug
> We refered the function named __ttm_dma_alloc_page in the kernel
> code as below:
> And there are similar methods in bch_bio_map and mem_to_page
> functions in current 4.14-rcx.
>
> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
> {
> struct dma_page *d_page;
>
> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
> if (!d_page)
> return NULL;
>
> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
> &d_page->dma,
> pool->gfp_flags);
> if (d_page->vaddr) {
> if (is_vmalloc_addr(d_page->vaddr))
> d_page->p = vmalloc_to_page(d_page->vaddr);
> else
> d_page->p = virt_to_page(d_page->vaddr);
There are cases on various architectures where neither of those is
right. Whether those actually intersect with TTM or RDMA use-cases is
another matter, of course.
What definitely is a problem is if you ever take that page and end up
accessing it through any virtual address other than the one explicitly
returned by dma_alloc_coherent(). That can blow the coherency wide open
and invite data loss, right up to killing the whole system with a
machine check on certain architectures.
Robin.
> } else {
> kfree(d_page);
> d_page = NULL;
> }
> return d_page;
> }
>
> Regards
> Wei Hu
>>
>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>> +++++++++++++++++++++++++++---
>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> index 3e4c525..a69cd4b 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>> *hr_dev, u32 size, u32 max_direct,
>>> goto err_free;
>>>
>>> for (i = 0; i < buf->nbufs; ++i)
>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>> + pages[i] =
>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>> + virt_to_page(buf->page_list[i].buf);
>>>
>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>> PAGE_KERNEL);
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> index 8388ae2..4a3d1d4 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> gfp_t gfp_mask)
>>> {
>>> struct hns_roce_hem_chunk *chunk = NULL;
>>> + struct hns_roce_vmalloc *vmalloc;
>>> struct hns_roce_hem *hem;
>>> struct scatterlist *mem;
>>> int order;
>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>> chunk->npages = 0;
>>> chunk->nsg = 0;
>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>> }
>>>
>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> if (!buf)
>>> goto fail;
>>>
>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + if (is_vmalloc_addr(buf)) {
>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>> + vmalloc->is_vmalloc_addr = true;
>>> + vmalloc->vmalloc_addr = buf;
>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>> + PAGE_SIZE << order, offset_in_page(buf));
>>> + } else {
>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + }
>>> WARN_ON(mem->offset);
>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>
>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>> hns_roce_hem *hem)
>>> {
>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>> + void *cpu_addr;
>>> int i;
>>>
>>> if (!hem)
>>> return;
>>>
>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>> - for (i = 0; i < chunk->npages; ++i)
>>> + for (i = 0; i < chunk->npages; ++i) {
>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>> + else
>>> + cpu_addr =
>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>> +
>>> dma_free_coherent(hr_dev->dev,
>>> chunk->mem[i].length,
>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>> + cpu_addr,
>>> sg_dma_address(&chunk->mem[i]));
>>> + }
>>> kfree(chunk);
>>> }
>>>
>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>> *hr_dev,
>>>
>>> if (chunk->mem[i].length > (u32)offset) {
>>> page = sg_page(&chunk->mem[i]);
>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>> + mutex_unlock(&table->mutex);
>>> + return page ?
>>> + chunk->vmalloc[i].vmalloc_addr
>>> + + offset : NULL;
>>> + }
>>> goto out;
>>> }
>>> offset -= chunk->mem[i].length;
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> index af28bbf..62d712a 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> @@ -72,11 +72,17 @@ enum {
>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>> };
>>>
>>> +struct hns_roce_vmalloc {
>>> + bool is_vmalloc_addr;
>>> + void *vmalloc_addr;
>>> +};
>>> +
>>> struct hns_roce_hem_chunk {
>>> struct list_head list;
>>> int npages;
>>> int nsg;
>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>> };
>>>
>>> struct hns_roce_hem {
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> index b99d70a..9e19bf1 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>> {
>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>> struct scatterlist *sg;
>>> + u64 page_addr = 0;
>>> u64 *pages;
>>> + int i = 0, j = 0;
>>> + int len = 0;
>>> int entry;
>>> - int i;
>>>
>>> mpt_entry = mb_buf;
>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>>
>>> i = 0;
>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>> -
>>> - /* Record the first 2 entry directly to MTPT table */
>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> - break;
>>> - i++;
>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>> + for (j = 0; j < len; ++j) {
>>> + page_addr = sg_dma_address(sg) +
>>> + (j << mr->umem->page_shift);
>>> + pages[i] = page_addr >> 6;
>>> +
>>> + /* Record the first 2 entry directly to MTPT table */
>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> + goto found;
>>> + i++;
>>> + }
>>> }
>>>
>>> +found:
>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>> V2_MPT_BYTE_56_PA0_H_S,
>>> --
>>> 1.9.1
>>>
>
>
> _______________________________________________
> iommu mailing list
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-10-12 12:59 ` Robin Murphy
0 siblings, 0 replies; 57+ messages in thread
From: Robin Murphy @ 2017-10-12 12:59 UTC (permalink / raw)
To: Wei Hu (Xavier), Leon Romanovsky
Cc: shaobo.xu, xavier.huwei, lijun_nudt, oulijun, linux-rdma,
charles.chenxin, linuxarm, iommu, linux-kernel, linux-mm,
dledford, liuyixian, zhangxiping3, shaoboxu
On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/1 0:10, Leon Romanovsky wrote:
>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>> If the IOMMU is enabled, the length of sg obtained from
>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>> dma address, the IOVA will not be page continuous. and the VA
>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>> the VA obtained by the page_address is a discontinuous VA. Under
>>> these circumstances, the IOVA should be calculated based on the
>>> sg length, and record the VA returned from dma_alloc_coherent
>>> in the struct of hem.
>>>
>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>> ---
>> Doug,
>>
>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>
>> Thanks
> Hi, Leon & Doug
> We refered the function named __ttm_dma_alloc_page in the kernel
> code as below:
> And there are similar methods in bch_bio_map and mem_to_page
> functions in current 4.14-rcx.
>
> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
> {
> struct dma_page *d_page;
>
> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
> if (!d_page)
> return NULL;
>
> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
> &d_page->dma,
> pool->gfp_flags);
> if (d_page->vaddr) {
> if (is_vmalloc_addr(d_page->vaddr))
> d_page->p = vmalloc_to_page(d_page->vaddr);
> else
> d_page->p = virt_to_page(d_page->vaddr);
There are cases on various architectures where neither of those is
right. Whether those actually intersect with TTM or RDMA use-cases is
another matter, of course.
What definitely is a problem is if you ever take that page and end up
accessing it through any virtual address other than the one explicitly
returned by dma_alloc_coherent(). That can blow the coherency wide open
and invite data loss, right up to killing the whole system with a
machine check on certain architectures.
Robin.
> } else {
> kfree(d_page);
> d_page = NULL;
> }
> return d_page;
> }
>
> Regards
> Wei Hu
>>
>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>> +++++++++++++++++++++++++++---
>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> index 3e4c525..a69cd4b 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>> *hr_dev, u32 size, u32 max_direct,
>>> goto err_free;
>>>
>>> for (i = 0; i < buf->nbufs; ++i)
>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>> + pages[i] =
>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>> + virt_to_page(buf->page_list[i].buf);
>>>
>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>> PAGE_KERNEL);
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> index 8388ae2..4a3d1d4 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> gfp_t gfp_mask)
>>> {
>>> struct hns_roce_hem_chunk *chunk = NULL;
>>> + struct hns_roce_vmalloc *vmalloc;
>>> struct hns_roce_hem *hem;
>>> struct scatterlist *mem;
>>> int order;
>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>> chunk->npages = 0;
>>> chunk->nsg = 0;
>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>> }
>>>
>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> if (!buf)
>>> goto fail;
>>>
>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + if (is_vmalloc_addr(buf)) {
>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>> + vmalloc->is_vmalloc_addr = true;
>>> + vmalloc->vmalloc_addr = buf;
>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>> + PAGE_SIZE << order, offset_in_page(buf));
>>> + } else {
>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + }
>>> WARN_ON(mem->offset);
>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>
>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>> hns_roce_hem *hem)
>>> {
>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>> + void *cpu_addr;
>>> int i;
>>>
>>> if (!hem)
>>> return;
>>>
>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>> - for (i = 0; i < chunk->npages; ++i)
>>> + for (i = 0; i < chunk->npages; ++i) {
>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>> + else
>>> + cpu_addr =
>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>> +
>>> dma_free_coherent(hr_dev->dev,
>>> chunk->mem[i].length,
>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>> + cpu_addr,
>>> sg_dma_address(&chunk->mem[i]));
>>> + }
>>> kfree(chunk);
>>> }
>>>
>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>> *hr_dev,
>>>
>>> if (chunk->mem[i].length > (u32)offset) {
>>> page = sg_page(&chunk->mem[i]);
>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>> + mutex_unlock(&table->mutex);
>>> + return page ?
>>> + chunk->vmalloc[i].vmalloc_addr
>>> + + offset : NULL;
>>> + }
>>> goto out;
>>> }
>>> offset -= chunk->mem[i].length;
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> index af28bbf..62d712a 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> @@ -72,11 +72,17 @@ enum {
>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>> };
>>>
>>> +struct hns_roce_vmalloc {
>>> + bool is_vmalloc_addr;
>>> + void *vmalloc_addr;
>>> +};
>>> +
>>> struct hns_roce_hem_chunk {
>>> struct list_head list;
>>> int npages;
>>> int nsg;
>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>> };
>>>
>>> struct hns_roce_hem {
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> index b99d70a..9e19bf1 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>> {
>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>> struct scatterlist *sg;
>>> + u64 page_addr = 0;
>>> u64 *pages;
>>> + int i = 0, j = 0;
>>> + int len = 0;
>>> int entry;
>>> - int i;
>>>
>>> mpt_entry = mb_buf;
>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>>
>>> i = 0;
>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>> -
>>> - /* Record the first 2 entry directly to MTPT table */
>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> - break;
>>> - i++;
>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>> + for (j = 0; j < len; ++j) {
>>> + page_addr = sg_dma_address(sg) +
>>> + (j << mr->umem->page_shift);
>>> + pages[i] = page_addr >> 6;
>>> +
>>> + /* Record the first 2 entry directly to MTPT table */
>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> + goto found;
>>> + i++;
>>> + }
>>> }
>>>
>>> +found:
>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>> V2_MPT_BYTE_56_PA0_H_S,
>>> --
>>> 1.9.1
>>>
>
>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-10-12 12:59 ` Robin Murphy
0 siblings, 0 replies; 57+ messages in thread
From: Robin Murphy @ 2017-10-12 12:59 UTC (permalink / raw)
To: Wei Hu (Xavier), Leon Romanovsky
Cc: shaobo.xu, xavier.huwei, lijun_nudt, oulijun, linux-rdma,
charles.chenxin, linuxarm, iommu, linux-kernel, linux-mm,
dledford, liuyixian, zhangxiping3, shaoboxu
On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/1 0:10, Leon Romanovsky wrote:
>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>> If the IOMMU is enabled, the length of sg obtained from
>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>> dma address, the IOVA will not be page continuous. and the VA
>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>> the VA obtained by the page_address is a discontinuous VA. Under
>>> these circumstances, the IOVA should be calculated based on the
>>> sg length, and record the VA returned from dma_alloc_coherent
>>> in the struct of hem.
>>>
>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>> ---
>> Doug,
>>
>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>
>> Thanks
> Hi,A Leon & Doug
> A A A We refered the function named __ttm_dma_alloc_page in the kernel
> code as below:
> A A A And there are similar methods in bch_bio_map and mem_to_page
> functions in current 4.14-rcx.
>
> A A A A A A A static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
> A A A A A A A {
> A A A A A A A A A A A struct dma_page *d_page;
>
> A A A A A A A A A A A d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
> A A A A A A A A A A A if (!d_page)
> A A A A A A A A A A A A A A A return NULL;
>
> A A A A A A A A A A A d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
> A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A &d_page->dma,
> A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A pool->gfp_flags);
> A A A A A A A A A A A if (d_page->vaddr) {
> A A A A A A A A A A A A A A A if (is_vmalloc_addr(d_page->vaddr))
> A A A A A A A A A A A A A A A A A A A d_page->p = vmalloc_to_page(d_page->vaddr);
> A A A A A A A A A A A A A A A else
> A A A A A A A A A A A A A A A A A A A d_page->p = virt_to_page(d_page->vaddr);
There are cases on various architectures where neither of those is
right. Whether those actually intersect with TTM or RDMA use-cases is
another matter, of course.
What definitely is a problem is if you ever take that page and end up
accessing it through any virtual address other than the one explicitly
returned by dma_alloc_coherent(). That can blow the coherency wide open
and invite data loss, right up to killing the whole system with a
machine check on certain architectures.
Robin.
> A A A A A A A A A A A } else {
> A A A A A A A A A A A A A A A kfree(d_page);
> A A A A A A A A A A A A A A A d_page = NULL;
> A A A A A A A A A A A }
> A A A A A A A A A A A return d_page;
> A A A A A A A }
>
> A A A Regards
> Wei Hu
>>
>>> A drivers/infiniband/hw/hns/hns_roce_alloc.c |A 5 ++++-
>>> A drivers/infiniband/hw/hns/hns_roce_hem.cA A | 30
>>> +++++++++++++++++++++++++++---
>>> A drivers/infiniband/hw/hns/hns_roce_hem.hA A |A 6 ++++++
>>> A drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>> A 4 files changed, 52 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> index 3e4c525..a69cd4b 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>> *hr_dev, u32 size, u32 max_direct,
>>> A A A A A A A A A A A A A A A A A goto err_free;
>>>
>>> A A A A A A A A A A A A A for (i = 0; i < buf->nbufs; ++i)
>>> -A A A A A A A A A A A A A A A pages[i] = virt_to_page(buf->page_list[i].buf);
>>> +A A A A A A A A A A A A A A A pages[i] =
>>> +A A A A A A A A A A A A A A A A A A A is_vmalloc_addr(buf->page_list[i].buf) ?
>>> +A A A A A A A A A A A A A A A A A A A vmalloc_to_page(buf->page_list[i].buf) :
>>> +A A A A A A A A A A A A A A A A A A A virt_to_page(buf->page_list[i].buf);
>>>
>>> A A A A A A A A A A A A A buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>> A A A A A A A A A A A A A A A A A A A A A A A A A A A A PAGE_KERNEL);
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> index 8388ae2..4a3d1d4 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> A A A A A A A A A A A A A A A A A A A A A A A A A A A A gfp_t gfp_mask)
>>> A {
>>> A A A A A struct hns_roce_hem_chunk *chunk = NULL;
>>> +A A A struct hns_roce_vmalloc *vmalloc;
>>> A A A A A struct hns_roce_hem *hem;
>>> A A A A A struct scatterlist *mem;
>>> A A A A A int order;
>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> A A A A A A A A A A A A A sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>> A A A A A A A A A A A A A chunk->npages = 0;
>>> A A A A A A A A A A A A A chunk->nsg = 0;
>>> +A A A A A A A A A A A memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>> A A A A A A A A A A A A A list_add_tail(&chunk->list, &hem->chunk_list);
>>> A A A A A A A A A }
>>>
>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> A A A A A A A A A if (!buf)
>>> A A A A A A A A A A A A A goto fail;
>>>
>>> -A A A A A A A sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> +A A A A A A A if (is_vmalloc_addr(buf)) {
>>> +A A A A A A A A A A A vmalloc = &chunk->vmalloc[chunk->npages];
>>> +A A A A A A A A A A A vmalloc->is_vmalloc_addr = true;
>>> +A A A A A A A A A A A vmalloc->vmalloc_addr = buf;
>>> +A A A A A A A A A A A sg_set_page(mem, vmalloc_to_page(buf),
>>> +A A A A A A A A A A A A A A A A A A A PAGE_SIZE << order, offset_in_page(buf));
>>> +A A A A A A A } else {
>>> +A A A A A A A A A A A sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> +A A A A A A A }
>>> A A A A A A A A A WARN_ON(mem->offset);
>>> A A A A A A A A A sg_dma_len(mem) = PAGE_SIZE << order;
>>>
>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> A void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>> hns_roce_hem *hem)
>>> A {
>>> A A A A A struct hns_roce_hem_chunk *chunk, *tmp;
>>> +A A A void *cpu_addr;
>>> A A A A A int i;
>>>
>>> A A A A A if (!hem)
>>> A A A A A A A A A return;
>>>
>>> A A A A A list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>> -A A A A A A A for (i = 0; i < chunk->npages; ++i)
>>> +A A A A A A A for (i = 0; i < chunk->npages; ++i) {
>>> +A A A A A A A A A A A if (chunk->vmalloc[i].is_vmalloc_addr)
>>> +A A A A A A A A A A A A A A A cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>> +A A A A A A A A A A A else
>>> +A A A A A A A A A A A A A A A cpu_addr =
>>> +A A A A A A A A A A A A A A A A A A lowmem_page_address(sg_page(&chunk->mem[i]));
>>> +
>>> A A A A A A A A A A A A A dma_free_coherent(hr_dev->dev,
>>> A A A A A A A A A A A A A A A A A A A A chunk->mem[i].length,
>>> -A A A A A A A A A A A A A A A A A A lowmem_page_address(sg_page(&chunk->mem[i])),
>>> +A A A A A A A A A A A A A A A A A A cpu_addr,
>>> A A A A A A A A A A A A A A A A A A A A sg_dma_address(&chunk->mem[i]));
>>> +A A A A A A A }
>>> A A A A A A A A A kfree(chunk);
>>> A A A A A }
>>>
>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>> *hr_dev,
>>>
>>> A A A A A A A A A A A A A if (chunk->mem[i].length > (u32)offset) {
>>> A A A A A A A A A A A A A A A A A page = sg_page(&chunk->mem[i]);
>>> +A A A A A A A A A A A A A A A if (chunk->vmalloc[i].is_vmalloc_addr) {
>>> +A A A A A A A A A A A A A A A A A A A mutex_unlock(&table->mutex);
>>> +A A A A A A A A A A A A A A A A A A A return page ?
>>> +A A A A A A A A A A A A A A A A A A A A A A A chunk->vmalloc[i].vmalloc_addr
>>> +A A A A A A A A A A A A A A A A A A A A A A A + offset : NULL;
>>> +A A A A A A A A A A A A A A A }
>>> A A A A A A A A A A A A A A A A A goto out;
>>> A A A A A A A A A A A A A }
>>> A A A A A A A A A A A A A offset -= chunk->mem[i].length;
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> index af28bbf..62d712a 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> @@ -72,11 +72,17 @@ enum {
>>> A A A A A A HNS_ROCE_HEM_PAGE_SIZEA = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>> A };
>>>
>>> +struct hns_roce_vmalloc {
>>> +A A A boolA A A is_vmalloc_addr;
>>> +A A A voidA A A *vmalloc_addr;
>>> +};
>>> +
>>> A struct hns_roce_hem_chunk {
>>> A A A A A struct list_headA A A A list;
>>> A A A A A intA A A A A A A A A A A A npages;
>>> A A A A A intA A A A A A A A A A A A nsg;
>>> A A A A A struct scatterlistA A A A mem[HNS_ROCE_HEM_CHUNK_LEN];
>>> +A A A struct hns_roce_vmallocA A A A vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>> A };
>>>
>>> A struct hns_roce_hem {
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> index b99d70a..9e19bf1 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>> A {
>>> A A A A A struct hns_roce_v2_mpt_entry *mpt_entry;
>>> A A A A A struct scatterlist *sg;
>>> +A A A u64 page_addr = 0;
>>> A A A A A u64 *pages;
>>> +A A A int i = 0, j = 0;
>>> +A A A int len = 0;
>>> A A A A A int entry;
>>> -A A A int i;
>>>
>>> A A A A A mpt_entry = mb_buf;
>>> A A A A A memset(mpt_entry, 0, sizeof(*mpt_entry));
>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>>
>>> A A A A A i = 0;
>>> A A A A A for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>> -A A A A A A A pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>> -
>>> -A A A A A A A /* Record the first 2 entry directly to MTPT table */
>>> -A A A A A A A if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> -A A A A A A A A A A A break;
>>> -A A A A A A A i++;
>>> +A A A A A A A len = sg_dma_len(sg) >> PAGE_SHIFT;
>>> +A A A A A A A for (j = 0; j < len; ++j) {
>>> +A A A A A A A A A A A page_addr = sg_dma_address(sg) +
>>> +A A A A A A A A A A A A A A A A A A A (j << mr->umem->page_shift);
>>> +A A A A A A A A A A A pages[i] = page_addr >> 6;
>>> +
>>> +A A A A A A A A A A A /* Record the first 2 entry directly to MTPT table */
>>> +A A A A A A A A A A A if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> +A A A A A A A A A A A A A A A goto found;
>>> +A A A A A A A A A A A i++;
>>> +A A A A A A A }
>>> A A A A A }
>>>
>>> +found:
>>> A A A A A mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>> A A A A A roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>> A A A A A A A A A A A A A A A A V2_MPT_BYTE_56_PA0_H_S,
>>> --
>>> 1.9.1
>>>
>
>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-10-12 12:31 ` Wei Hu (Xavier)
(?)
(?)
@ 2017-10-12 14:54 ` Leon Romanovsky
-1 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-10-12 14:54 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, linux-mm, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobo.xu, shaoboxu, leizhen 00275356, joro, iommu
[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]
On Thu, Oct 12, 2017 at 08:31:31PM +0800, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/1 0:10, Leon Romanovsky wrote:
> > On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
> > > If the IOMMU is enabled, the length of sg obtained from
> > > __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
> > > dma address, the IOVA will not be page continuous. and the VA
> > > returned from dma_alloc_coherent is a vmalloc address. However,
> > > the VA obtained by the page_address is a discontinuous VA. Under
> > > these circumstances, the IOVA should be calculated based on the
> > > sg length, and record the VA returned from dma_alloc_coherent
> > > in the struct of hem.
> > >
> > > Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
> > > Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
> > > Signed-off-by: Lijun Ou <oulijun@huawei.com>
> > > ---
> > Doug,
> >
> > I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> > driver code to deal with dma_alloc_coherent is most probably wrong.
> >
> > Thanks
> Hi, Leon & Doug
> We refered the function named __ttm_dma_alloc_page in the kernel code as
> below:
> And there are similar methods in bch_bio_map and mem_to_page functions
> in current 4.14-rcx.
Let's put aside TTM, I don't know the rationale behind their implementation,
but both mem_to_page and bch_bio_map are don't operate on DMA addresses
and don't belong to HW driver code.
Thanks
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 3/4] RDMA/hns: Update the IRRL table chunk size in hip08
2017-10-01 5:40 ` Leon Romanovsky
@ 2017-10-17 11:40 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-17 11:40 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, shaobo.xu, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
On 2017/10/1 13:40, Leon Romanovsky wrote:
> On Sat, Sep 30, 2017 at 05:29:00PM +0800, Wei Hu (Xavier) wrote:
>> As the increase of the IRRL specification in hip08, the IRRL table
>> chunk size needs to be updated.
>> This patch updates the IRRL table chunk size to 256k for hip08.
>>
>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>> ---
>> drivers/infiniband/hw/hns/hns_roce_device.h | 3 +++
>> drivers/infiniband/hw/hns/hns_roce_hem.c | 31 ++++++++++++++---------------
>> drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
>> drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 ++
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 1 +
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 2 ++
>> 6 files changed, 24 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
>> index 9353400..fc2a53d 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_device.h
>> +++ b/drivers/infiniband/hw/hns/hns_roce_device.h
>> @@ -236,6 +236,8 @@ struct hns_roce_hem_table {
>> unsigned long num_obj;
>> /*Single obj size */
>> unsigned long obj_size;
>> + unsigned long table_chunk_size;
>> + unsigned long hem_alloc_size;
>> int lowmem;
>> struct mutex mutex;
>> struct hns_roce_hem **hem;
>> @@ -565,6 +567,7 @@ struct hns_roce_caps {
>> u32 cqe_ba_pg_sz;
>> u32 cqe_buf_pg_sz;
>> u32 cqe_hop_num;
>> + u32 chunk_sz; /* chunk size in non multihop mode*/
>> };
> Hi,
>
> I have two comments:
> 1. In this code table_chunk_size is equal and similar to hem_alloc_size.
> Please don't introduce unneeded complexity.
Hi, Leon
ok, we will delete hem_alloc_size.
Thanks.
> 2. The size of table is num_obj * obj_size, there is no need to
> table_chunk_size and hem_alloc_size at all. There are plenty of macros in
> the kernel to deal with the tables.
Hi, Leon
chunk size is limited by the hardware and is max access capactiy by
hardware in hip06 and hip08 SoC.
For exmple, chunk size is 128K in hip06 and 256K in hip08.
Thanks
> Thanks
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 3/4] RDMA/hns: Update the IRRL table chunk size in hip08
@ 2017-10-17 11:40 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-17 11:40 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, shaobo.xu, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
On 2017/10/1 13:40, Leon Romanovsky wrote:
> On Sat, Sep 30, 2017 at 05:29:00PM +0800, Wei Hu (Xavier) wrote:
>> As the increase of the IRRL specification in hip08, the IRRL table
>> chunk size needs to be updated.
>> This patch updates the IRRL table chunk size to 256k for hip08.
>>
>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>> ---
>> drivers/infiniband/hw/hns/hns_roce_device.h | 3 +++
>> drivers/infiniband/hw/hns/hns_roce_hem.c | 31 ++++++++++++++---------------
>> drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 1 +
>> drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 2 ++
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 1 +
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 2 ++
>> 6 files changed, 24 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
>> index 9353400..fc2a53d 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_device.h
>> +++ b/drivers/infiniband/hw/hns/hns_roce_device.h
>> @@ -236,6 +236,8 @@ struct hns_roce_hem_table {
>> unsigned long num_obj;
>> /*Single obj size */
>> unsigned long obj_size;
>> + unsigned long table_chunk_size;
>> + unsigned long hem_alloc_size;
>> int lowmem;
>> struct mutex mutex;
>> struct hns_roce_hem **hem;
>> @@ -565,6 +567,7 @@ struct hns_roce_caps {
>> u32 cqe_ba_pg_sz;
>> u32 cqe_buf_pg_sz;
>> u32 cqe_hop_num;
>> + u32 chunk_sz; /* chunk size in non multihop mode*/
>> };
> Hi,
>
> I have two comments:
> 1. In this code table_chunk_size is equal and similar to hem_alloc_size.
> Please don't introduce unneeded complexity.
Hi, Leon
ok, we will delete hem_alloc_size.
Thanks.
> 2. The size of table is num_obj * obj_size, there is no need to
> table_chunk_size and hem_alloc_size at all. There are plenty of macros in
> the kernel to deal with the tables.
Hi, Leon
chunk size is limited by the hardware and is max access capactiy by
hardware in hip06 and hip08 SoC.
For exmple, chunk size is 128K in hip06 and 256K in hip08.
Thanks
> Thanks
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-09-30 16:10 ` Leon Romanovsky
@ 2017-10-18 8:42 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-18 8:42 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
On 2017/10/1 0:10, Leon Romanovsky wrote:
> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>> If the IOMMU is enabled, the length of sg obtained from
>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>> dma address, the IOVA will not be page continuous. and the VA
>> returned from dma_alloc_coherent is a vmalloc address. However,
>> the VA obtained by the page_address is a discontinuous VA. Under
>> these circumstances, the IOVA should be calculated based on the
>> sg length, and record the VA returned from dma_alloc_coherent
>> in the struct of hem.
>>
>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>> ---
> Doug,
>
> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> driver code to deal with dma_alloc_coherent is most probably wrong.
>
> Thanks
>
Hi, Doug
When running in ARM64 platform, there probably be calltrace currently.
Now our colleague will report it to iommu maillist and try to solve it.
I also think RoCE driver shouldn't sense the difference.
I will pull it out of this series and send v2.
Thanks.
Regards
Wei Hu
>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> index 3e4c525..a69cd4b 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>> + pages[i] =
>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>> + vmalloc_to_page(buf->page_list[i].buf) :
>> + virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> index 8388ae2..4a3d1d4 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> @@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> gfp_t gfp_mask)
>> {
>> struct hns_roce_hem_chunk *chunk = NULL;
>> + struct hns_roce_vmalloc *vmalloc;
>> struct hns_roce_hem *hem;
>> struct scatterlist *mem;
>> int order;
>> @@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>> chunk->npages = 0;
>> chunk->nsg = 0;
>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>> list_add_tail(&chunk->list, &hem->chunk_list);
>> }
>>
>> @@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> if (!buf)
>> goto fail;
>>
>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + if (is_vmalloc_addr(buf)) {
>> + vmalloc = &chunk->vmalloc[chunk->npages];
>> + vmalloc->is_vmalloc_addr = true;
>> + vmalloc->vmalloc_addr = buf;
>> + sg_set_page(mem, vmalloc_to_page(buf),
>> + PAGE_SIZE << order, offset_in_page(buf));
>> + } else {
>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + }
>> WARN_ON(mem->offset);
>> sg_dma_len(mem) = PAGE_SIZE << order;
>>
>> @@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
>> {
>> struct hns_roce_hem_chunk *chunk, *tmp;
>> + void *cpu_addr;
>> int i;
>>
>> if (!hem)
>> return;
>>
>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>> - for (i = 0; i < chunk->npages; ++i)
>> + for (i = 0; i < chunk->npages; ++i) {
>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>> + else
>> + cpu_addr =
>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>> +
>> dma_free_coherent(hr_dev->dev,
>> chunk->mem[i].length,
>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>> + cpu_addr,
>> sg_dma_address(&chunk->mem[i]));
>> + }
>> kfree(chunk);
>> }
>>
>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
>>
>> if (chunk->mem[i].length > (u32)offset) {
>> page = sg_page(&chunk->mem[i]);
>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>> + mutex_unlock(&table->mutex);
>> + return page ?
>> + chunk->vmalloc[i].vmalloc_addr
>> + + offset : NULL;
>> + }
>> goto out;
>> }
>> offset -= chunk->mem[i].length;
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> index af28bbf..62d712a 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> @@ -72,11 +72,17 @@ enum {
>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>> };
>>
>> +struct hns_roce_vmalloc {
>> + bool is_vmalloc_addr;
>> + void *vmalloc_addr;
>> +};
>> +
>> struct hns_roce_hem_chunk {
>> struct list_head list;
>> int npages;
>> int nsg;
>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>> };
>>
>> struct hns_roce_hem {
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> index b99d70a..9e19bf1 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>> {
>> struct hns_roce_v2_mpt_entry *mpt_entry;
>> struct scatterlist *sg;
>> + u64 page_addr = 0;
>> u64 *pages;
>> + int i = 0, j = 0;
>> + int len = 0;
>> int entry;
>> - int i;
>>
>> mpt_entry = mb_buf;
>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>>
>> i = 0;
>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>> -
>> - /* Record the first 2 entry directly to MTPT table */
>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> - break;
>> - i++;
>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>> + for (j = 0; j < len; ++j) {
>> + page_addr = sg_dma_address(sg) +
>> + (j << mr->umem->page_shift);
>> + pages[i] = page_addr >> 6;
>> +
>> + /* Record the first 2 entry directly to MTPT table */
>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> + goto found;
>> + i++;
>> + }
>> }
>>
>> +found:
>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>> V2_MPT_BYTE_56_PA0_H_S,
>> --
>> 1.9.1
>>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-10-18 8:42 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-18 8:42 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford, linux-rdma, lijun_nudt, oulijun, charles.chenxin,
liuyixian, xushaobo2, zhangxiping3, xavier.huwei, linuxarm,
linux-kernel, shaobohsu, shaoboxu
On 2017/10/1 0:10, Leon Romanovsky wrote:
> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>> If the IOMMU is enabled, the length of sg obtained from
>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>> dma address, the IOVA will not be page continuous. and the VA
>> returned from dma_alloc_coherent is a vmalloc address. However,
>> the VA obtained by the page_address is a discontinuous VA. Under
>> these circumstances, the IOVA should be calculated based on the
>> sg length, and record the VA returned from dma_alloc_coherent
>> in the struct of hem.
>>
>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>> ---
> Doug,
>
> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> driver code to deal with dma_alloc_coherent is most probably wrong.
>
> Thanks
>
Hi, Doug
When running in ARM64 platform, there probably be calltrace currently.
Now our colleague will report it to iommu maillist and try to solve it.
I also think RoCE driver shouldn't sense the difference.
I will pull it out of this series and send v2.
Thanks.
Regards
Wei Hu
>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30 +++++++++++++++++++++++++++---
>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> index 3e4c525..a69cd4b 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev *hr_dev, u32 size, u32 max_direct,
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>> + pages[i] =
>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>> + vmalloc_to_page(buf->page_list[i].buf) :
>> + virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> index 8388ae2..4a3d1d4 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>> @@ -200,6 +200,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> gfp_t gfp_mask)
>> {
>> struct hns_roce_hem_chunk *chunk = NULL;
>> + struct hns_roce_vmalloc *vmalloc;
>> struct hns_roce_hem *hem;
>> struct scatterlist *mem;
>> int order;
>> @@ -227,6 +228,7 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>> chunk->npages = 0;
>> chunk->nsg = 0;
>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>> list_add_tail(&chunk->list, &hem->chunk_list);
>> }
>>
>> @@ -243,7 +245,15 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> if (!buf)
>> goto fail;
>>
>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + if (is_vmalloc_addr(buf)) {
>> + vmalloc = &chunk->vmalloc[chunk->npages];
>> + vmalloc->is_vmalloc_addr = true;
>> + vmalloc->vmalloc_addr = buf;
>> + sg_set_page(mem, vmalloc_to_page(buf),
>> + PAGE_SIZE << order, offset_in_page(buf));
>> + } else {
>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>> + }
>> WARN_ON(mem->offset);
>> sg_dma_len(mem) = PAGE_SIZE << order;
>>
>> @@ -262,17 +272,25 @@ static struct hns_roce_hem *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct hns_roce_hem *hem)
>> {
>> struct hns_roce_hem_chunk *chunk, *tmp;
>> + void *cpu_addr;
>> int i;
>>
>> if (!hem)
>> return;
>>
>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>> - for (i = 0; i < chunk->npages; ++i)
>> + for (i = 0; i < chunk->npages; ++i) {
>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>> + else
>> + cpu_addr =
>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>> +
>> dma_free_coherent(hr_dev->dev,
>> chunk->mem[i].length,
>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>> + cpu_addr,
>> sg_dma_address(&chunk->mem[i]));
>> + }
>> kfree(chunk);
>> }
>>
>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev *hr_dev,
>>
>> if (chunk->mem[i].length > (u32)offset) {
>> page = sg_page(&chunk->mem[i]);
>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>> + mutex_unlock(&table->mutex);
>> + return page ?
>> + chunk->vmalloc[i].vmalloc_addr
>> + + offset : NULL;
>> + }
>> goto out;
>> }
>> offset -= chunk->mem[i].length;
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> index af28bbf..62d712a 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>> @@ -72,11 +72,17 @@ enum {
>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>> };
>>
>> +struct hns_roce_vmalloc {
>> + bool is_vmalloc_addr;
>> + void *vmalloc_addr;
>> +};
>> +
>> struct hns_roce_hem_chunk {
>> struct list_head list;
>> int npages;
>> int nsg;
>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>> };
>>
>> struct hns_roce_hem {
>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> index b99d70a..9e19bf1 100644
>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>> {
>> struct hns_roce_v2_mpt_entry *mpt_entry;
>> struct scatterlist *sg;
>> + u64 page_addr = 0;
>> u64 *pages;
>> + int i = 0, j = 0;
>> + int len = 0;
>> int entry;
>> - int i;
>>
>> mpt_entry = mb_buf;
>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void *mb_buf, struct hns_roce_mr *mr,
>>
>> i = 0;
>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>> -
>> - /* Record the first 2 entry directly to MTPT table */
>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> - break;
>> - i++;
>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>> + for (j = 0; j < len; ++j) {
>> + page_addr = sg_dma_address(sg) +
>> + (j << mr->umem->page_shift);
>> + pages[i] = page_addr >> 6;
>> +
>> + /* Record the first 2 entry directly to MTPT table */
>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>> + goto found;
>> + i++;
>> + }
>> }
>>
>> +found:
>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>> V2_MPT_BYTE_56_PA0_H_S,
>> --
>> 1.9.1
>>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-10-18 8:42 ` Wei Hu (Xavier)
@ 2017-10-18 9:12 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-18 9:12 UTC (permalink / raw)
To: Leon Romanovsky
Cc: lijun_nudt-9Onoh4P/yGk, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
shaobohsu-9Onoh4P/yGk, linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dledford-H+wXaHxf7aLQT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw,
shaobo.xu-ral2JQCrhuEAvxtiuMwx3w
On 2017/10/18 16:42, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/1 0:10, Leon Romanovsky wrote:
>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>> If the IOMMU is enabled, the length of sg obtained from
>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>> dma address, the IOVA will not be page continuous. and the VA
>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>> the VA obtained by the page_address is a discontinuous VA. Under
>>> these circumstances, the IOVA should be calculated based on the
>>> sg length, and record the VA returned from dma_alloc_coherent
>>> in the struct of hem.
>>>
>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>> ---
>> Doug,
>>
>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>
>> Thanks
>>
> Hi, Doug
> When running in ARM64 platform, there probably be calltrace
> currently.
> Now our colleague will report it to iommu maillist and try to
> solve it.
> I also think RoCE driver shouldn't sense the difference.
> I will pull it out of this series and send v2.
> Thanks.
>
Hi, Doug & Leon
I have sent patch v2.
Thanks
Regards
Wei Hu
> Regards
> Wei Hu
>
>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>> +++++++++++++++++++++++++++---
>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22
>>> +++++++++++++++-------
>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> index 3e4c525..a69cd4b 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>> *hr_dev, u32 size, u32 max_direct,
>>> goto err_free;
>>>
>>> for (i = 0; i < buf->nbufs; ++i)
>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>> + pages[i] =
>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>> + virt_to_page(buf->page_list[i].buf);
>>>
>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>> PAGE_KERNEL);
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> index 8388ae2..4a3d1d4 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> gfp_t gfp_mask)
>>> {
>>> struct hns_roce_hem_chunk *chunk = NULL;
>>> + struct hns_roce_vmalloc *vmalloc;
>>> struct hns_roce_hem *hem;
>>> struct scatterlist *mem;
>>> int order;
>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>> chunk->npages = 0;
>>> chunk->nsg = 0;
>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>> }
>>>
>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> if (!buf)
>>> goto fail;
>>>
>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + if (is_vmalloc_addr(buf)) {
>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>> + vmalloc->is_vmalloc_addr = true;
>>> + vmalloc->vmalloc_addr = buf;
>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>> + PAGE_SIZE << order, offset_in_page(buf));
>>> + } else {
>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + }
>>> WARN_ON(mem->offset);
>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>
>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>> hns_roce_hem *hem)
>>> {
>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>> + void *cpu_addr;
>>> int i;
>>>
>>> if (!hem)
>>> return;
>>>
>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>> - for (i = 0; i < chunk->npages; ++i)
>>> + for (i = 0; i < chunk->npages; ++i) {
>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>> + else
>>> + cpu_addr =
>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>> +
>>> dma_free_coherent(hr_dev->dev,
>>> chunk->mem[i].length,
>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>> + cpu_addr,
>>> sg_dma_address(&chunk->mem[i]));
>>> + }
>>> kfree(chunk);
>>> }
>>>
>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>> *hr_dev,
>>>
>>> if (chunk->mem[i].length > (u32)offset) {
>>> page = sg_page(&chunk->mem[i]);
>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>> + mutex_unlock(&table->mutex);
>>> + return page ?
>>> + chunk->vmalloc[i].vmalloc_addr
>>> + + offset : NULL;
>>> + }
>>> goto out;
>>> }
>>> offset -= chunk->mem[i].length;
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> index af28bbf..62d712a 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> @@ -72,11 +72,17 @@ enum {
>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>> };
>>>
>>> +struct hns_roce_vmalloc {
>>> + bool is_vmalloc_addr;
>>> + void *vmalloc_addr;
>>> +};
>>> +
>>> struct hns_roce_hem_chunk {
>>> struct list_head list;
>>> int npages;
>>> int nsg;
>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>> };
>>>
>>> struct hns_roce_hem {
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> index b99d70a..9e19bf1 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>> {
>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>> struct scatterlist *sg;
>>> + u64 page_addr = 0;
>>> u64 *pages;
>>> + int i = 0, j = 0;
>>> + int len = 0;
>>> int entry;
>>> - int i;
>>>
>>> mpt_entry = mb_buf;
>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>>
>>> i = 0;
>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>> -
>>> - /* Record the first 2 entry directly to MTPT table */
>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> - break;
>>> - i++;
>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>> + for (j = 0; j < len; ++j) {
>>> + page_addr = sg_dma_address(sg) +
>>> + (j << mr->umem->page_shift);
>>> + pages[i] = page_addr >> 6;
>>> +
>>> + /* Record the first 2 entry directly to MTPT table */
>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> + goto found;
>>> + i++;
>>> + }
>>> }
>>>
>>> +found:
>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>> V2_MPT_BYTE_56_PA0_H_S,
>>> --
>>> 1.9.1
>>>
>
>
> _______________________________________________
> linuxarm mailing list
> linuxarm-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm
>
> .
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-10-18 9:12 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-10-18 9:12 UTC (permalink / raw)
To: Leon Romanovsky
Cc: lijun_nudt, linux-rdma, shaobohsu, linuxarm, linux-kernel,
dledford, zhangxiping3, shaoboxu, shaobo.xu, Doug Ledford,
Liuyixian (Eason), Chenxin (Charles)
On 2017/10/18 16:42, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/1 0:10, Leon Romanovsky wrote:
>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>> If the IOMMU is enabled, the length of sg obtained from
>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>> dma address, the IOVA will not be page continuous. and the VA
>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>> the VA obtained by the page_address is a discontinuous VA. Under
>>> these circumstances, the IOVA should be calculated based on the
>>> sg length, and record the VA returned from dma_alloc_coherent
>>> in the struct of hem.
>>>
>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>> ---
>> Doug,
>>
>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>
>> Thanks
>>
> Hi, Doug
> When running in ARM64 platform, there probably be calltrace
> currently.
> Now our colleague will report it to iommu maillist and try to
> solve it.
> I also think RoCE driver shouldn't sense the difference.
> I will pull it out of this series and send v2.
> Thanks.
>
Hi, Doug & Leon
I have sent patch v2.
Thanks
Regards
Wei Hu
> Regards
> Wei Hu
>
>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>> +++++++++++++++++++++++++++---
>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22
>>> +++++++++++++++-------
>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> index 3e4c525..a69cd4b 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>> *hr_dev, u32 size, u32 max_direct,
>>> goto err_free;
>>>
>>> for (i = 0; i < buf->nbufs; ++i)
>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>> + pages[i] =
>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>> + virt_to_page(buf->page_list[i].buf);
>>>
>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>> PAGE_KERNEL);
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> index 8388ae2..4a3d1d4 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> gfp_t gfp_mask)
>>> {
>>> struct hns_roce_hem_chunk *chunk = NULL;
>>> + struct hns_roce_vmalloc *vmalloc;
>>> struct hns_roce_hem *hem;
>>> struct scatterlist *mem;
>>> int order;
>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>> chunk->npages = 0;
>>> chunk->nsg = 0;
>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>> }
>>>
>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> if (!buf)
>>> goto fail;
>>>
>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + if (is_vmalloc_addr(buf)) {
>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>> + vmalloc->is_vmalloc_addr = true;
>>> + vmalloc->vmalloc_addr = buf;
>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>> + PAGE_SIZE << order, offset_in_page(buf));
>>> + } else {
>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>> + }
>>> WARN_ON(mem->offset);
>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>
>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>> hns_roce_hem *hem)
>>> {
>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>> + void *cpu_addr;
>>> int i;
>>>
>>> if (!hem)
>>> return;
>>>
>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>> - for (i = 0; i < chunk->npages; ++i)
>>> + for (i = 0; i < chunk->npages; ++i) {
>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>> + else
>>> + cpu_addr =
>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>> +
>>> dma_free_coherent(hr_dev->dev,
>>> chunk->mem[i].length,
>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>> + cpu_addr,
>>> sg_dma_address(&chunk->mem[i]));
>>> + }
>>> kfree(chunk);
>>> }
>>>
>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>> *hr_dev,
>>>
>>> if (chunk->mem[i].length > (u32)offset) {
>>> page = sg_page(&chunk->mem[i]);
>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>> + mutex_unlock(&table->mutex);
>>> + return page ?
>>> + chunk->vmalloc[i].vmalloc_addr
>>> + + offset : NULL;
>>> + }
>>> goto out;
>>> }
>>> offset -= chunk->mem[i].length;
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> index af28bbf..62d712a 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>> @@ -72,11 +72,17 @@ enum {
>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>> };
>>>
>>> +struct hns_roce_vmalloc {
>>> + bool is_vmalloc_addr;
>>> + void *vmalloc_addr;
>>> +};
>>> +
>>> struct hns_roce_hem_chunk {
>>> struct list_head list;
>>> int npages;
>>> int nsg;
>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>> };
>>>
>>> struct hns_roce_hem {
>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> index b99d70a..9e19bf1 100644
>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>> {
>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>> struct scatterlist *sg;
>>> + u64 page_addr = 0;
>>> u64 *pages;
>>> + int i = 0, j = 0;
>>> + int len = 0;
>>> int entry;
>>> - int i;
>>>
>>> mpt_entry = mb_buf;
>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>> *mb_buf, struct hns_roce_mr *mr,
>>>
>>> i = 0;
>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>> -
>>> - /* Record the first 2 entry directly to MTPT table */
>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> - break;
>>> - i++;
>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>> + for (j = 0; j < len; ++j) {
>>> + page_addr = sg_dma_address(sg) +
>>> + (j << mr->umem->page_shift);
>>> + pages[i] = page_addr >> 6;
>>> +
>>> + /* Record the first 2 entry directly to MTPT table */
>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>> + goto found;
>>> + i++;
>>> + }
>>> }
>>>
>>> +found:
>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>> V2_MPT_BYTE_56_PA0_H_S,
>>> --
>>> 1.9.1
>>>
>
>
> _______________________________________________
> linuxarm mailing list
> linuxarm@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm
>
> .
>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-10-18 9:12 ` Wei Hu (Xavier)
@ 2017-10-18 14:23 ` Leon Romanovsky
-1 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-10-18 14:23 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: lijun_nudt-9Onoh4P/yGk, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
shaobohsu-9Onoh4P/yGk, linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
dledford-H+wXaHxf7aLQT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw,
shaobo.xu-ral2JQCrhuEAvxtiuMwx3w, Liuyixian (Eason),
Chenxin (Charles)
[-- Attachment #1: Type: text/plain, Size: 1628 bytes --]
On Wed, Oct 18, 2017 at 05:12:02PM +0800, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/18 16:42, Wei Hu (Xavier) wrote:
> >
> >
> > On 2017/10/1 0:10, Leon Romanovsky wrote:
> > > On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
> > > > If the IOMMU is enabled, the length of sg obtained from
> > > > __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
> > > > dma address, the IOVA will not be page continuous. and the VA
> > > > returned from dma_alloc_coherent is a vmalloc address. However,
> > > > the VA obtained by the page_address is a discontinuous VA. Under
> > > > these circumstances, the IOVA should be calculated based on the
> > > > sg length, and record the VA returned from dma_alloc_coherent
> > > > in the struct of hem.
> > > >
> > > > Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > > Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > > Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > > ---
> > > Doug,
> > >
> > > I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> > > driver code to deal with dma_alloc_coherent is most probably wrong.
> > >
> > > Thanks
> > >
> > Hi, Doug
> > When running in ARM64 platform, there probably be calltrace
> > currently.
> > Now our colleague will report it to iommu maillist and try to solve
> > it.
> > I also think RoCE driver shouldn't sense the difference.
> > I will pull it out of this series and send v2.
> > Thanks.
> >
> Hi, Doug & Leon
> I have sent patch v2.
> Thanks
>
Thanks
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-10-18 14:23 ` Leon Romanovsky
0 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-10-18 14:23 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: lijun_nudt, linux-rdma, shaobohsu, linuxarm, linux-kernel,
dledford, zhangxiping3, shaoboxu, shaobo.xu, Liuyixian (Eason),
Chenxin (Charles)
[-- Attachment #1: Type: text/plain, Size: 1541 bytes --]
On Wed, Oct 18, 2017 at 05:12:02PM +0800, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/18 16:42, Wei Hu (Xavier) wrote:
> >
> >
> > On 2017/10/1 0:10, Leon Romanovsky wrote:
> > > On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
> > > > If the IOMMU is enabled, the length of sg obtained from
> > > > __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
> > > > dma address, the IOVA will not be page continuous. and the VA
> > > > returned from dma_alloc_coherent is a vmalloc address. However,
> > > > the VA obtained by the page_address is a discontinuous VA. Under
> > > > these circumstances, the IOVA should be calculated based on the
> > > > sg length, and record the VA returned from dma_alloc_coherent
> > > > in the struct of hem.
> > > >
> > > > Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
> > > > Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
> > > > Signed-off-by: Lijun Ou <oulijun@huawei.com>
> > > > ---
> > > Doug,
> > >
> > > I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> > > driver code to deal with dma_alloc_coherent is most probably wrong.
> > >
> > > Thanks
> > >
> > Hi, Doug
> > When running in ARM64 platform, there probably be calltrace
> > currently.
> > Now our colleague will report it to iommu maillist and try to solve
> > it.
> > I also think RoCE driver shouldn't sense the difference.
> > I will pull it out of this series and send v2.
> > Thanks.
> >
> Hi, Doug & Leon
> I have sent patch v2.
> Thanks
>
Thanks
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-10-12 12:59 ` Robin Murphy
@ 2017-11-01 7:46 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-01 7:46 UTC (permalink / raw)
To: Robin Murphy
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/10/12 20:59, Robin Murphy wrote:
> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>
>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>> If the IOMMU is enabled, the length of sg obtained from
>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>> dma address, the IOVA will not be page continuous. and the VA
>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>> these circumstances, the IOVA should be calculated based on the
>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>> in the struct of hem.
>>>>
>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>> ---
>>> Doug,
>>>
>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>
>>> Thanks
>> Hi, Leon & Doug
>> We refered the function named __ttm_dma_alloc_page in the kernel
>> code as below:
>> And there are similar methods in bch_bio_map and mem_to_page
>> functions in current 4.14-rcx.
>>
>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>> {
>> struct dma_page *d_page;
>>
>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>> if (!d_page)
>> return NULL;
>>
>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>> &d_page->dma,
>> pool->gfp_flags);
>> if (d_page->vaddr) {
>> if (is_vmalloc_addr(d_page->vaddr))
>> d_page->p = vmalloc_to_page(d_page->vaddr);
>> else
>> d_page->p = virt_to_page(d_page->vaddr);
> There are cases on various architectures where neither of those is
> right. Whether those actually intersect with TTM or RDMA use-cases is
> another matter, of course.
>
> What definitely is a problem is if you ever take that page and end up
> accessing it through any virtual address other than the one explicitly
> returned by dma_alloc_coherent(). That can blow the coherency wide open
> and invite data loss, right up to killing the whole system with a
> machine check on certain architectures.
>
> Robin.
Hi, Robin
Thanks for your comment.
We have one problem and the related code as below.
1. call dma_alloc_coherent function serval times to alloc memory.
2. vmap the allocated memory pages.
3. software access memory by using the return virt addr of vmap
and hardware using the dma addr of dma_alloc_coherent.
When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
before vmap(), it works. And when IOMMU is enabled using
virt_to_page() will cause calltrace later, we found the return
addr of dma_alloc_coherent is vmalloc addr, so we add the
condition judgement statement as below, it works.
for (i = 0; i < buf->nbufs; ++i)
pages[i] =
is_vmalloc_addr(buf->page_list[i].buf) ?
vmalloc_to_page(buf->page_list[i].buf) :
virt_to_page(buf->page_list[i].buf);
Can you give us suggestion? better method?
The related code as below:
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
GFP_KERNEL);
if (!buf->page_list)
return -ENOMEM;
for (i = 0; i < buf->nbufs; ++i) {
buf->page_list[i].buf = dma_alloc_coherent(dev,
page_size, &t,
GFP_KERNEL);
if (!buf->page_list[i].buf)
goto err_free;
buf->page_list[i].map = t;
memset(buf->page_list[i].buf, 0, page_size);
}
pages = kmalloc_array(buf->nbufs, sizeof(*pages),
GFP_KERNEL);
if (!pages)
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
pages[i] =
is_vmalloc_addr(buf->page_list[i].buf) ?
vmalloc_to_page(buf->page_list[i].buf) :
virt_to_page(buf->page_list[i].buf);
buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
PAGE_KERNEL);
kfree(pages);
if (!buf->direct.buf)
goto err_free;
Regards
Wei Hu
>> } else {
>> kfree(d_page);
>> d_page = NULL;
>> }
>> return d_page;
>> }
>>
>> Regards
>> Wei Hu
>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>> +++++++++++++++++++++++++++---
>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> index 3e4c525..a69cd4b 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>> *hr_dev, u32 size, u32 max_direct,
>>>> goto err_free;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>> + pages[i] =
>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>> + virt_to_page(buf->page_list[i].buf);
>>>>
>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>> PAGE_KERNEL);
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> index 8388ae2..4a3d1d4 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> gfp_t gfp_mask)
>>>> {
>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>> + struct hns_roce_vmalloc *vmalloc;
>>>> struct hns_roce_hem *hem;
>>>> struct scatterlist *mem;
>>>> int order;
>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>> chunk->npages = 0;
>>>> chunk->nsg = 0;
>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>> }
>>>>
>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> if (!buf)
>>>> goto fail;
>>>>
>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>> + if (is_vmalloc_addr(buf)) {
>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>> + vmalloc->is_vmalloc_addr = true;
>>>> + vmalloc->vmalloc_addr = buf;
>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>> + } else {
>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>> + }
>>>> WARN_ON(mem->offset);
>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>
>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>> hns_roce_hem *hem)
>>>> {
>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>> + void *cpu_addr;
>>>> int i;
>>>>
>>>> if (!hem)
>>>> return;
>>>>
>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>> - for (i = 0; i < chunk->npages; ++i)
>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>> + else
>>>> + cpu_addr =
>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>> +
>>>> dma_free_coherent(hr_dev->dev,
>>>> chunk->mem[i].length,
>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>> + cpu_addr,
>>>> sg_dma_address(&chunk->mem[i]));
>>>> + }
>>>> kfree(chunk);
>>>> }
>>>>
>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>> *hr_dev,
>>>>
>>>> if (chunk->mem[i].length > (u32)offset) {
>>>> page = sg_page(&chunk->mem[i]);
>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>> + mutex_unlock(&table->mutex);
>>>> + return page ?
>>>> + chunk->vmalloc[i].vmalloc_addr
>>>> + + offset : NULL;
>>>> + }
>>>> goto out;
>>>> }
>>>> offset -= chunk->mem[i].length;
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> index af28bbf..62d712a 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> @@ -72,11 +72,17 @@ enum {
>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>> };
>>>>
>>>> +struct hns_roce_vmalloc {
>>>> + bool is_vmalloc_addr;
>>>> + void *vmalloc_addr;
>>>> +};
>>>> +
>>>> struct hns_roce_hem_chunk {
>>>> struct list_head list;
>>>> int npages;
>>>> int nsg;
>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>> };
>>>>
>>>> struct hns_roce_hem {
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> index b99d70a..9e19bf1 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>> *mb_buf, struct hns_roce_mr *mr,
>>>> {
>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>> struct scatterlist *sg;
>>>> + u64 page_addr = 0;
>>>> u64 *pages;
>>>> + int i = 0, j = 0;
>>>> + int len = 0;
>>>> int entry;
>>>> - int i;
>>>>
>>>> mpt_entry = mb_buf;
>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>
>>>> i = 0;
>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>> -
>>>> - /* Record the first 2 entry directly to MTPT table */
>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>> - break;
>>>> - i++;
>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>> + for (j = 0; j < len; ++j) {
>>>> + page_addr = sg_dma_address(sg) +
>>>> + (j << mr->umem->page_shift);
>>>> + pages[i] = page_addr >> 6;
>>>> +
>>>> + /* Record the first 2 entry directly to MTPT table */
>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>> + goto found;
>>>> + i++;
>>>> + }
>>>> }
>>>>
>>>> +found:
>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>> --
>>>> 1.9.1
>>>>
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> .
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-01 7:46 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-01 7:46 UTC (permalink / raw)
To: Robin Murphy
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/10/12 20:59, Robin Murphy wrote:
> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>
>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>> If the IOMMU is enabled, the length of sg obtained from
>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>> dma address, the IOVA will not be page continuous. and the VA
>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>> these circumstances, the IOVA should be calculated based on the
>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>> in the struct of hem.
>>>>
>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>> ---
>>> Doug,
>>>
>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>
>>> Thanks
>> Hi, Leon & Doug
>> We refered the function named __ttm_dma_alloc_page in the kernel
>> code as below:
>> And there are similar methods in bch_bio_map and mem_to_page
>> functions in current 4.14-rcx.
>>
>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>> {
>> struct dma_page *d_page;
>>
>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>> if (!d_page)
>> return NULL;
>>
>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>> &d_page->dma,
>> pool->gfp_flags);
>> if (d_page->vaddr) {
>> if (is_vmalloc_addr(d_page->vaddr))
>> d_page->p = vmalloc_to_page(d_page->vaddr);
>> else
>> d_page->p = virt_to_page(d_page->vaddr);
> There are cases on various architectures where neither of those is
> right. Whether those actually intersect with TTM or RDMA use-cases is
> another matter, of course.
>
> What definitely is a problem is if you ever take that page and end up
> accessing it through any virtual address other than the one explicitly
> returned by dma_alloc_coherent(). That can blow the coherency wide open
> and invite data loss, right up to killing the whole system with a
> machine check on certain architectures.
>
> Robin.
Hi, Robin
Thanks for your comment.
We have one problem and the related code as below.
1. call dma_alloc_coherent function serval times to alloc memory.
2. vmap the allocated memory pages.
3. software access memory by using the return virt addr of vmap
and hardware using the dma addr of dma_alloc_coherent.
When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
before vmap(), it works. And when IOMMU is enabled using
virt_to_page() will cause calltrace later, we found the return
addr of dma_alloc_coherent is vmalloc addr, so we add the
condition judgement statement as below, it works.
for (i = 0; i < buf->nbufs; ++i)
pages[i] =
is_vmalloc_addr(buf->page_list[i].buf) ?
vmalloc_to_page(buf->page_list[i].buf) :
virt_to_page(buf->page_list[i].buf);
Can you give us suggestion? better method?
The related code as below:
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
GFP_KERNEL);
if (!buf->page_list)
return -ENOMEM;
for (i = 0; i < buf->nbufs; ++i) {
buf->page_list[i].buf = dma_alloc_coherent(dev,
page_size, &t,
GFP_KERNEL);
if (!buf->page_list[i].buf)
goto err_free;
buf->page_list[i].map = t;
memset(buf->page_list[i].buf, 0, page_size);
}
pages = kmalloc_array(buf->nbufs, sizeof(*pages),
GFP_KERNEL);
if (!pages)
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
pages[i] =
is_vmalloc_addr(buf->page_list[i].buf) ?
vmalloc_to_page(buf->page_list[i].buf) :
virt_to_page(buf->page_list[i].buf);
buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
PAGE_KERNEL);
kfree(pages);
if (!buf->direct.buf)
goto err_free;
Regards
Wei Hu
>> } else {
>> kfree(d_page);
>> d_page = NULL;
>> }
>> return d_page;
>> }
>>
>> Regards
>> Wei Hu
>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>> +++++++++++++++++++++++++++---
>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> index 3e4c525..a69cd4b 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>> *hr_dev, u32 size, u32 max_direct,
>>>> goto err_free;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>> + pages[i] =
>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>> + virt_to_page(buf->page_list[i].buf);
>>>>
>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>> PAGE_KERNEL);
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> index 8388ae2..4a3d1d4 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> gfp_t gfp_mask)
>>>> {
>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>> + struct hns_roce_vmalloc *vmalloc;
>>>> struct hns_roce_hem *hem;
>>>> struct scatterlist *mem;
>>>> int order;
>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>> chunk->npages = 0;
>>>> chunk->nsg = 0;
>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>> }
>>>>
>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> if (!buf)
>>>> goto fail;
>>>>
>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>> + if (is_vmalloc_addr(buf)) {
>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>> + vmalloc->is_vmalloc_addr = true;
>>>> + vmalloc->vmalloc_addr = buf;
>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>> + } else {
>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>> + }
>>>> WARN_ON(mem->offset);
>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>
>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>> hns_roce_hem *hem)
>>>> {
>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>> + void *cpu_addr;
>>>> int i;
>>>>
>>>> if (!hem)
>>>> return;
>>>>
>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>> - for (i = 0; i < chunk->npages; ++i)
>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>> + else
>>>> + cpu_addr =
>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>> +
>>>> dma_free_coherent(hr_dev->dev,
>>>> chunk->mem[i].length,
>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>> + cpu_addr,
>>>> sg_dma_address(&chunk->mem[i]));
>>>> + }
>>>> kfree(chunk);
>>>> }
>>>>
>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>> *hr_dev,
>>>>
>>>> if (chunk->mem[i].length > (u32)offset) {
>>>> page = sg_page(&chunk->mem[i]);
>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>> + mutex_unlock(&table->mutex);
>>>> + return page ?
>>>> + chunk->vmalloc[i].vmalloc_addr
>>>> + + offset : NULL;
>>>> + }
>>>> goto out;
>>>> }
>>>> offset -= chunk->mem[i].length;
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> index af28bbf..62d712a 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>> @@ -72,11 +72,17 @@ enum {
>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>> };
>>>>
>>>> +struct hns_roce_vmalloc {
>>>> + bool is_vmalloc_addr;
>>>> + void *vmalloc_addr;
>>>> +};
>>>> +
>>>> struct hns_roce_hem_chunk {
>>>> struct list_head list;
>>>> int npages;
>>>> int nsg;
>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>> };
>>>>
>>>> struct hns_roce_hem {
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> index b99d70a..9e19bf1 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>> *mb_buf, struct hns_roce_mr *mr,
>>>> {
>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>> struct scatterlist *sg;
>>>> + u64 page_addr = 0;
>>>> u64 *pages;
>>>> + int i = 0, j = 0;
>>>> + int len = 0;
>>>> int entry;
>>>> - int i;
>>>>
>>>> mpt_entry = mb_buf;
>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>
>>>> i = 0;
>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>> -
>>>> - /* Record the first 2 entry directly to MTPT table */
>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>> - break;
>>>> - i++;
>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>> + for (j = 0; j < len; ++j) {
>>>> + page_addr = sg_dma_address(sg) +
>>>> + (j << mr->umem->page_shift);
>>>> + pages[i] = page_addr >> 6;
>>>> +
>>>> + /* Record the first 2 entry directly to MTPT table */
>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>> + goto found;
>>>> + i++;
>>>> + }
>>>> }
>>>>
>>>> +found:
>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>> --
>>>> 1.9.1
>>>>
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> .
>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-01 7:46 ` Wei Hu (Xavier)
(?)
@ 2017-11-01 12:26 ` Robin Murphy
-1 siblings, 0 replies; 57+ messages in thread
From: Robin Murphy @ 2017-11-01 12:26 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: shaobo.xu-ral2JQCrhuEAvxtiuMwx3w, xavier.huwei-WVlzvzqoTvw,
lijun_nudt-9Onoh4P/yGk, oulijun-hv44wF8Li93QT0dZR+AlfA,
Leon Romanovsky, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw,
dledford-H+wXaHxf7aLQT0dZR+AlfA
On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/12 20:59, Robin Murphy wrote:
>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>
>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>> these circumstances, the IOVA should be calculated based on the
>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>> in the struct of hem.
>>>>>
>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>> Doug,
>>>>
>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>
>>>> Thanks
>>> Hi, Leon & Doug
>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>> code as below:
>>> And there are similar methods in bch_bio_map and mem_to_page
>>> functions in current 4.14-rcx.
>>>
>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>> {
>>> struct dma_page *d_page;
>>>
>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>> if (!d_page)
>>> return NULL;
>>>
>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>> &d_page->dma,
>>> pool->gfp_flags);
>>> if (d_page->vaddr) {
>>> if (is_vmalloc_addr(d_page->vaddr))
>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>> else
>>> d_page->p = virt_to_page(d_page->vaddr);
>> There are cases on various architectures where neither of those is
>> right. Whether those actually intersect with TTM or RDMA use-cases is
>> another matter, of course.
>>
>> What definitely is a problem is if you ever take that page and end up
>> accessing it through any virtual address other than the one explicitly
>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>> and invite data loss, right up to killing the whole system with a
>> machine check on certain architectures.
>>
>> Robin.
> Hi, Robin
> Thanks for your comment.
>
> We have one problem and the related code as below.
> 1. call dma_alloc_coherent function serval times to alloc memory.
> 2. vmap the allocated memory pages.
> 3. software access memory by using the return virt addr of vmap
> and hardware using the dma addr of dma_alloc_coherent.
The simple answer is "don't do that". Seriously. dma_alloc_coherent()
gives you a CPU virtual address and a DMA address with which to access
your buffer, and that is the limit of what you may infer about it. You
have no guarantee that the virtual address is either in the linear map
or vmalloc, and not some other special place. You have no guarantee that
the underlying memory even has an associated struct page at all.
> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
> before vmap(), it works. And when IOMMU is enabled using
> virt_to_page() will cause calltrace later, we found the return
> addr of dma_alloc_coherent is vmalloc addr, so we add the
> condition judgement statement as below, it works.
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] =
> is_vmalloc_addr(buf->page_list[i].buf) ?
> vmalloc_to_page(buf->page_list[i].buf) :
> virt_to_page(buf->page_list[i].buf);
> Can you give us suggestion? better method?
Oh my goodness, having now taken a closer look at this driver, I'm lost
for words in disbelief. To pick just one example:
u32 bits_per_long = BITS_PER_LONG;
...
if (bits_per_long == 64) {
/* memory mapping nonsense */
}
WTF does the size of a long have to do with DMA buffer management!?
Of course I can guess that it might be trying to make some tortuous
inference about vmalloc space being constrained on 32-bit platforms, but
still...
>
> The related code as below:
> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> GFP_KERNEL);
> if (!buf->page_list)
> return -ENOMEM;
>
> for (i = 0; i < buf->nbufs; ++i) {
> buf->page_list[i].buf = dma_alloc_coherent(dev,
> page_size, &t,
> GFP_KERNEL);
> if (!buf->page_list[i].buf)
> goto err_free;
>
> buf->page_list[i].map = t;
> memset(buf->page_list[i].buf, 0, page_size);
> }
>
> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> GFP_KERNEL);
> if (!pages)
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] =
> is_vmalloc_addr(buf->page_list[i].buf) ?
> vmalloc_to_page(buf->page_list[i].buf) :
> virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> kfree(pages);
> if (!buf->direct.buf)
> goto err_free;
OK, this is complete crap. As above, you cannot assume that a struct
page even exists; even if it does you cannot assume that using a
PAGE_KERNEL mapping will not result in mismatched attributes,
unpredictable behaviour and data loss. Trying to remap coherent DMA
allocations like this is just egregiously wrong.
What I do like is that you can seemingly fix all this by simply deleting
hns_roce_buf::direct and all the garbage code related to it, and using
the page_list entries consistently because the alternate paths involving
those appear to do the right thing already.
That is, of course, assuming that the buffers involved can be so large
that it's not practical to just always make a single allocation and
fragment it into multiple descriptors if the hardware does have some
maximum length constraint - frankly I'm a little puzzled by the
PAGE_SIZE * 2 threshold, given that that's not a fixed size.
Robin.
>
> Regards
> Wei Hu
>>> } else {
>>> kfree(d_page);
>>> d_page = NULL;
>>> }
>>> return d_page;
>>> }
>>>
>>> Regards
>>> Wei Hu
>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>> +++++++++++++++++++++++++++---
>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> index 3e4c525..a69cd4b 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>> goto err_free;
>>>>>
>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>> + pages[i] =
>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>
>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>> PAGE_KERNEL);
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> index 8388ae2..4a3d1d4 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> gfp_t gfp_mask)
>>>>> {
>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>> struct hns_roce_hem *hem;
>>>>> struct scatterlist *mem;
>>>>> int order;
>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>> chunk->npages = 0;
>>>>> chunk->nsg = 0;
>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>> }
>>>>>
>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> if (!buf)
>>>>> goto fail;
>>>>>
>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>> + if (is_vmalloc_addr(buf)) {
>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>> + vmalloc->vmalloc_addr = buf;
>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>> + } else {
>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>> + }
>>>>> WARN_ON(mem->offset);
>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>
>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>> hns_roce_hem *hem)
>>>>> {
>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>> + void *cpu_addr;
>>>>> int i;
>>>>>
>>>>> if (!hem)
>>>>> return;
>>>>>
>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>> + else
>>>>> + cpu_addr =
>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>> +
>>>>> dma_free_coherent(hr_dev->dev,
>>>>> chunk->mem[i].length,
>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>> + cpu_addr,
>>>>> sg_dma_address(&chunk->mem[i]));
>>>>> + }
>>>>> kfree(chunk);
>>>>> }
>>>>>
>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>> *hr_dev,
>>>>>
>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>> page = sg_page(&chunk->mem[i]);
>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>> + mutex_unlock(&table->mutex);
>>>>> + return page ?
>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>> + + offset : NULL;
>>>>> + }
>>>>> goto out;
>>>>> }
>>>>> offset -= chunk->mem[i].length;
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> index af28bbf..62d712a 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> @@ -72,11 +72,17 @@ enum {
>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>> };
>>>>>
>>>>> +struct hns_roce_vmalloc {
>>>>> + bool is_vmalloc_addr;
>>>>> + void *vmalloc_addr;
>>>>> +};
>>>>> +
>>>>> struct hns_roce_hem_chunk {
>>>>> struct list_head list;
>>>>> int npages;
>>>>> int nsg;
>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>> };
>>>>>
>>>>> struct hns_roce_hem {
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> index b99d70a..9e19bf1 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>> {
>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>> struct scatterlist *sg;
>>>>> + u64 page_addr = 0;
>>>>> u64 *pages;
>>>>> + int i = 0, j = 0;
>>>>> + int len = 0;
>>>>> int entry;
>>>>> - int i;
>>>>>
>>>>> mpt_entry = mb_buf;
>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>
>>>>> i = 0;
>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>> -
>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>> - break;
>>>>> - i++;
>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>> + for (j = 0; j < len; ++j) {
>>>>> + page_addr = sg_dma_address(sg) +
>>>>> + (j << mr->umem->page_shift);
>>>>> + pages[i] = page_addr >> 6;
>>>>> +
>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>> + goto found;
>>>>> + i++;
>>>>> + }
>>>>> }
>>>>>
>>>>> +found:
>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>> --
>>>>> 1.9.1
>>>>>
>>>
>>> _______________________________________________
>>> iommu mailing list
>>> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
>> .
>>
>
>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-01 12:26 ` Robin Murphy
0 siblings, 0 replies; 57+ messages in thread
From: Robin Murphy @ 2017-11-01 12:26 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/12 20:59, Robin Murphy wrote:
>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>
>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>> these circumstances, the IOVA should be calculated based on the
>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>> in the struct of hem.
>>>>>
>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>> ---
>>>> Doug,
>>>>
>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>
>>>> Thanks
>>> Hi, Leon & Doug
>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>> code as below:
>>> And there are similar methods in bch_bio_map and mem_to_page
>>> functions in current 4.14-rcx.
>>>
>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>> {
>>> struct dma_page *d_page;
>>>
>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>> if (!d_page)
>>> return NULL;
>>>
>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>> &d_page->dma,
>>> pool->gfp_flags);
>>> if (d_page->vaddr) {
>>> if (is_vmalloc_addr(d_page->vaddr))
>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>> else
>>> d_page->p = virt_to_page(d_page->vaddr);
>> There are cases on various architectures where neither of those is
>> right. Whether those actually intersect with TTM or RDMA use-cases is
>> another matter, of course.
>>
>> What definitely is a problem is if you ever take that page and end up
>> accessing it through any virtual address other than the one explicitly
>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>> and invite data loss, right up to killing the whole system with a
>> machine check on certain architectures.
>>
>> Robin.
> Hi, Robin
> Thanks for your comment.
>
> We have one problem and the related code as below.
> 1. call dma_alloc_coherent function serval times to alloc memory.
> 2. vmap the allocated memory pages.
> 3. software access memory by using the return virt addr of vmap
> and hardware using the dma addr of dma_alloc_coherent.
The simple answer is "don't do that". Seriously. dma_alloc_coherent()
gives you a CPU virtual address and a DMA address with which to access
your buffer, and that is the limit of what you may infer about it. You
have no guarantee that the virtual address is either in the linear map
or vmalloc, and not some other special place. You have no guarantee that
the underlying memory even has an associated struct page at all.
> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
> before vmap(), it works. And when IOMMU is enabled using
> virt_to_page() will cause calltrace later, we found the return
> addr of dma_alloc_coherent is vmalloc addr, so we add the
> condition judgement statement as below, it works.
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] =
> is_vmalloc_addr(buf->page_list[i].buf) ?
> vmalloc_to_page(buf->page_list[i].buf) :
> virt_to_page(buf->page_list[i].buf);
> Can you give us suggestion? better method?
Oh my goodness, having now taken a closer look at this driver, I'm lost
for words in disbelief. To pick just one example:
u32 bits_per_long = BITS_PER_LONG;
...
if (bits_per_long == 64) {
/* memory mapping nonsense */
}
WTF does the size of a long have to do with DMA buffer management!?
Of course I can guess that it might be trying to make some tortuous
inference about vmalloc space being constrained on 32-bit platforms, but
still...
>
> The related code as below:
> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> GFP_KERNEL);
> if (!buf->page_list)
> return -ENOMEM;
>
> for (i = 0; i < buf->nbufs; ++i) {
> buf->page_list[i].buf = dma_alloc_coherent(dev,
> page_size, &t,
> GFP_KERNEL);
> if (!buf->page_list[i].buf)
> goto err_free;
>
> buf->page_list[i].map = t;
> memset(buf->page_list[i].buf, 0, page_size);
> }
>
> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> GFP_KERNEL);
> if (!pages)
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] =
> is_vmalloc_addr(buf->page_list[i].buf) ?
> vmalloc_to_page(buf->page_list[i].buf) :
> virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> kfree(pages);
> if (!buf->direct.buf)
> goto err_free;
OK, this is complete crap. As above, you cannot assume that a struct
page even exists; even if it does you cannot assume that using a
PAGE_KERNEL mapping will not result in mismatched attributes,
unpredictable behaviour and data loss. Trying to remap coherent DMA
allocations like this is just egregiously wrong.
What I do like is that you can seemingly fix all this by simply deleting
hns_roce_buf::direct and all the garbage code related to it, and using
the page_list entries consistently because the alternate paths involving
those appear to do the right thing already.
That is, of course, assuming that the buffers involved can be so large
that it's not practical to just always make a single allocation and
fragment it into multiple descriptors if the hardware does have some
maximum length constraint - frankly I'm a little puzzled by the
PAGE_SIZE * 2 threshold, given that that's not a fixed size.
Robin.
>
> Regards
> Wei Hu
>>> } else {
>>> kfree(d_page);
>>> d_page = NULL;
>>> }
>>> return d_page;
>>> }
>>>
>>> Regards
>>> Wei Hu
>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>> +++++++++++++++++++++++++++---
>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> index 3e4c525..a69cd4b 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>> goto err_free;
>>>>>
>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>> + pages[i] =
>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>
>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>> PAGE_KERNEL);
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> index 8388ae2..4a3d1d4 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> gfp_t gfp_mask)
>>>>> {
>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>> struct hns_roce_hem *hem;
>>>>> struct scatterlist *mem;
>>>>> int order;
>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>> chunk->npages = 0;
>>>>> chunk->nsg = 0;
>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>> }
>>>>>
>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> if (!buf)
>>>>> goto fail;
>>>>>
>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>> + if (is_vmalloc_addr(buf)) {
>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>> + vmalloc->vmalloc_addr = buf;
>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>> + } else {
>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>> + }
>>>>> WARN_ON(mem->offset);
>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>
>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>> hns_roce_hem *hem)
>>>>> {
>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>> + void *cpu_addr;
>>>>> int i;
>>>>>
>>>>> if (!hem)
>>>>> return;
>>>>>
>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>> + else
>>>>> + cpu_addr =
>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>> +
>>>>> dma_free_coherent(hr_dev->dev,
>>>>> chunk->mem[i].length,
>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>> + cpu_addr,
>>>>> sg_dma_address(&chunk->mem[i]));
>>>>> + }
>>>>> kfree(chunk);
>>>>> }
>>>>>
>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>> *hr_dev,
>>>>>
>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>> page = sg_page(&chunk->mem[i]);
>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>> + mutex_unlock(&table->mutex);
>>>>> + return page ?
>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>> + + offset : NULL;
>>>>> + }
>>>>> goto out;
>>>>> }
>>>>> offset -= chunk->mem[i].length;
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> index af28bbf..62d712a 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> @@ -72,11 +72,17 @@ enum {
>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>> };
>>>>>
>>>>> +struct hns_roce_vmalloc {
>>>>> + bool is_vmalloc_addr;
>>>>> + void *vmalloc_addr;
>>>>> +};
>>>>> +
>>>>> struct hns_roce_hem_chunk {
>>>>> struct list_head list;
>>>>> int npages;
>>>>> int nsg;
>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>> };
>>>>>
>>>>> struct hns_roce_hem {
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> index b99d70a..9e19bf1 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>> {
>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>> struct scatterlist *sg;
>>>>> + u64 page_addr = 0;
>>>>> u64 *pages;
>>>>> + int i = 0, j = 0;
>>>>> + int len = 0;
>>>>> int entry;
>>>>> - int i;
>>>>>
>>>>> mpt_entry = mb_buf;
>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>
>>>>> i = 0;
>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>> -
>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>> - break;
>>>>> - i++;
>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>> + for (j = 0; j < len; ++j) {
>>>>> + page_addr = sg_dma_address(sg) +
>>>>> + (j << mr->umem->page_shift);
>>>>> + pages[i] = page_addr >> 6;
>>>>> +
>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>> + goto found;
>>>>> + i++;
>>>>> + }
>>>>> }
>>>>>
>>>>> +found:
>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>> --
>>>>> 1.9.1
>>>>>
>>>
>>> _______________________________________________
>>> iommu mailing list
>>> iommu@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
>> .
>>
>
>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-01 12:26 ` Robin Murphy
0 siblings, 0 replies; 57+ messages in thread
From: Robin Murphy @ 2017-11-01 12:26 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>
>
> On 2017/10/12 20:59, Robin Murphy wrote:
>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>
>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>> these circumstances, the IOVA should be calculated based on the
>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>> in the struct of hem.
>>>>>
>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>> ---
>>>> Doug,
>>>>
>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>
>>>> Thanks
>>> Hi, Leon & Doug
>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>> code as below:
>>> And there are similar methods in bch_bio_map and mem_to_page
>>> functions in current 4.14-rcx.
>>>
>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>> {
>>> struct dma_page *d_page;
>>>
>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>> if (!d_page)
>>> return NULL;
>>>
>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>> &d_page->dma,
>>> pool->gfp_flags);
>>> if (d_page->vaddr) {
>>> if (is_vmalloc_addr(d_page->vaddr))
>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>> else
>>> d_page->p = virt_to_page(d_page->vaddr);
>> There are cases on various architectures where neither of those is
>> right. Whether those actually intersect with TTM or RDMA use-cases is
>> another matter, of course.
>>
>> What definitely is a problem is if you ever take that page and end up
>> accessing it through any virtual address other than the one explicitly
>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>> and invite data loss, right up to killing the whole system with a
>> machine check on certain architectures.
>>
>> Robin.
> Hi, Robin
> Thanks for your comment.
>
> We have one problem and the related code as below.
> 1. call dma_alloc_coherent function serval times to alloc memory.
> 2. vmap the allocated memory pages.
> 3. software access memory by using the return virt addr of vmap
> and hardware using the dma addr of dma_alloc_coherent.
The simple answer is "don't do that". Seriously. dma_alloc_coherent()
gives you a CPU virtual address and a DMA address with which to access
your buffer, and that is the limit of what you may infer about it. You
have no guarantee that the virtual address is either in the linear map
or vmalloc, and not some other special place. You have no guarantee that
the underlying memory even has an associated struct page at all.
> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
> before vmap(), it works. And when IOMMU is enabled using
> virt_to_page() will cause calltrace later, we found the return
> addr of dma_alloc_coherent is vmalloc addr, so we add the
> condition judgement statement as below, it works.
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] =
> is_vmalloc_addr(buf->page_list[i].buf) ?
> vmalloc_to_page(buf->page_list[i].buf) :
> virt_to_page(buf->page_list[i].buf);
> Can you give us suggestion? better method?
Oh my goodness, having now taken a closer look at this driver, I'm lost
for words in disbelief. To pick just one example:
u32 bits_per_long = BITS_PER_LONG;
...
if (bits_per_long == 64) {
/* memory mapping nonsense */
}
WTF does the size of a long have to do with DMA buffer management!?
Of course I can guess that it might be trying to make some tortuous
inference about vmalloc space being constrained on 32-bit platforms, but
still...
>
> The related code as below:
> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> GFP_KERNEL);
> if (!buf->page_list)
> return -ENOMEM;
>
> for (i = 0; i < buf->nbufs; ++i) {
> buf->page_list[i].buf = dma_alloc_coherent(dev,
> page_size, &t,
> GFP_KERNEL);
> if (!buf->page_list[i].buf)
> goto err_free;
>
> buf->page_list[i].map = t;
> memset(buf->page_list[i].buf, 0, page_size);
> }
>
> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> GFP_KERNEL);
> if (!pages)
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] =
> is_vmalloc_addr(buf->page_list[i].buf) ?
> vmalloc_to_page(buf->page_list[i].buf) :
> virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> kfree(pages);
> if (!buf->direct.buf)
> goto err_free;
OK, this is complete crap. As above, you cannot assume that a struct
page even exists; even if it does you cannot assume that using a
PAGE_KERNEL mapping will not result in mismatched attributes,
unpredictable behaviour and data loss. Trying to remap coherent DMA
allocations like this is just egregiously wrong.
What I do like is that you can seemingly fix all this by simply deleting
hns_roce_buf::direct and all the garbage code related to it, and using
the page_list entries consistently because the alternate paths involving
those appear to do the right thing already.
That is, of course, assuming that the buffers involved can be so large
that it's not practical to just always make a single allocation and
fragment it into multiple descriptors if the hardware does have some
maximum length constraint - frankly I'm a little puzzled by the
PAGE_SIZE * 2 threshold, given that that's not a fixed size.
Robin.
>
> Regards
> Wei Hu
>>> } else {
>>> kfree(d_page);
>>> d_page = NULL;
>>> }
>>> return d_page;
>>> }
>>>
>>> Regards
>>> Wei Hu
>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>> +++++++++++++++++++++++++++---
>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> index 3e4c525..a69cd4b 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>> goto err_free;
>>>>>
>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>> + pages[i] =
>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>
>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>> PAGE_KERNEL);
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> index 8388ae2..4a3d1d4 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> gfp_t gfp_mask)
>>>>> {
>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>> struct hns_roce_hem *hem;
>>>>> struct scatterlist *mem;
>>>>> int order;
>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>> chunk->npages = 0;
>>>>> chunk->nsg = 0;
>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>> }
>>>>>
>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> if (!buf)
>>>>> goto fail;
>>>>>
>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>> + if (is_vmalloc_addr(buf)) {
>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>> + vmalloc->vmalloc_addr = buf;
>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>> + } else {
>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>> + }
>>>>> WARN_ON(mem->offset);
>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>
>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>> hns_roce_hem *hem)
>>>>> {
>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>> + void *cpu_addr;
>>>>> int i;
>>>>>
>>>>> if (!hem)
>>>>> return;
>>>>>
>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>> + else
>>>>> + cpu_addr =
>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>> +
>>>>> dma_free_coherent(hr_dev->dev,
>>>>> chunk->mem[i].length,
>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>> + cpu_addr,
>>>>> sg_dma_address(&chunk->mem[i]));
>>>>> + }
>>>>> kfree(chunk);
>>>>> }
>>>>>
>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>> *hr_dev,
>>>>>
>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>> page = sg_page(&chunk->mem[i]);
>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>> + mutex_unlock(&table->mutex);
>>>>> + return page ?
>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>> + + offset : NULL;
>>>>> + }
>>>>> goto out;
>>>>> }
>>>>> offset -= chunk->mem[i].length;
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> index af28bbf..62d712a 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>> @@ -72,11 +72,17 @@ enum {
>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>> };
>>>>>
>>>>> +struct hns_roce_vmalloc {
>>>>> + bool is_vmalloc_addr;
>>>>> + void *vmalloc_addr;
>>>>> +};
>>>>> +
>>>>> struct hns_roce_hem_chunk {
>>>>> struct list_head list;
>>>>> int npages;
>>>>> int nsg;
>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>> };
>>>>>
>>>>> struct hns_roce_hem {
>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> index b99d70a..9e19bf1 100644
>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>> {
>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>> struct scatterlist *sg;
>>>>> + u64 page_addr = 0;
>>>>> u64 *pages;
>>>>> + int i = 0, j = 0;
>>>>> + int len = 0;
>>>>> int entry;
>>>>> - int i;
>>>>>
>>>>> mpt_entry = mb_buf;
>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>
>>>>> i = 0;
>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>> -
>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>> - break;
>>>>> - i++;
>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>> + for (j = 0; j < len; ++j) {
>>>>> + page_addr = sg_dma_address(sg) +
>>>>> + (j << mr->umem->page_shift);
>>>>> + pages[i] = page_addr >> 6;
>>>>> +
>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>> + goto found;
>>>>> + i++;
>>>>> + }
>>>>> }
>>>>>
>>>>> +found:
>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>> --
>>>>> 1.9.1
>>>>>
>>>
>>> _______________________________________________
>>> iommu mailing list
>>> iommu@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
>> .
>>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-01 12:26 ` Robin Murphy
(?)
@ 2017-11-07 2:45 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-07 2:45 UTC (permalink / raw)
To: Robin Murphy
Cc: Leon Romanovsky, shaobo.xu-ral2JQCrhuEAvxtiuMwx3w,
xavier.huwei-WVlzvzqoTvw, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, dledford-H+wXaHxf7aLQT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw
On 2017/11/1 20:26, Robin Murphy wrote:
> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>
>> On 2017/10/12 20:59, Robin Murphy wrote:
>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>> in the struct of hem.
>>>>>>
>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>>> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>>> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>>> ---
>>>>> Doug,
>>>>>
>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>
>>>>> Thanks
>>>> Hi, Leon & Doug
>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>> code as below:
>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>> functions in current 4.14-rcx.
>>>>
>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>> {
>>>> struct dma_page *d_page;
>>>>
>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>> if (!d_page)
>>>> return NULL;
>>>>
>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>> &d_page->dma,
>>>> pool->gfp_flags);
>>>> if (d_page->vaddr) {
>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>> else
>>>> d_page->p = virt_to_page(d_page->vaddr);
>>> There are cases on various architectures where neither of those is
>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>> another matter, of course.
>>>
>>> What definitely is a problem is if you ever take that page and end up
>>> accessing it through any virtual address other than the one explicitly
>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>> and invite data loss, right up to killing the whole system with a
>>> machine check on certain architectures.
>>>
>>> Robin.
>> Hi, Robin
>> Thanks for your comment.
>>
>> We have one problem and the related code as below.
>> 1. call dma_alloc_coherent function serval times to alloc memory.
>> 2. vmap the allocated memory pages.
>> 3. software access memory by using the return virt addr of vmap
>> and hardware using the dma addr of dma_alloc_coherent.
> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
> gives you a CPU virtual address and a DMA address with which to access
> your buffer, and that is the limit of what you may infer about it. You
> have no guarantee that the virtual address is either in the linear map
> or vmalloc, and not some other special place. You have no guarantee that
> the underlying memory even has an associated struct page at all.
>
>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>> before vmap(), it works. And when IOMMU is enabled using
>> virt_to_page() will cause calltrace later, we found the return
>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>> condition judgement statement as below, it works.
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>> Can you give us suggestion? better method?
> Oh my goodness, having now taken a closer look at this driver, I'm lost
> for words in disbelief. To pick just one example:
>
> u32 bits_per_long = BITS_PER_LONG;
> ...
> if (bits_per_long == 64) {
> /* memory mapping nonsense */
> }
>
> WTF does the size of a long have to do with DMA buffer management!?
>
> Of course I can guess that it might be trying to make some tortuous
> inference about vmalloc space being constrained on 32-bit platforms, but
> still...
>
>> The related code as below:
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>> page_size, &t,
>> GFP_KERNEL);
>> if (!buf->page_list[i].buf)
>> goto err_free;
>>
>> buf->page_list[i].map = t;
>> memset(buf->page_list[i].buf, 0, page_size);
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
> OK, this is complete crap. As above, you cannot assume that a struct
> page even exists; even if it does you cannot assume that using a
> PAGE_KERNEL mapping will not result in mismatched attributes,
> unpredictable behaviour and data loss. Trying to remap coherent DMA
> allocations like this is just egregiously wrong.
>
> What I do like is that you can seemingly fix all this by simply deleting
> hns_roce_buf::direct and all the garbage code related to it, and using
> the page_list entries consistently because the alternate paths involving
> those appear to do the right thing already.
>
> That is, of course, assuming that the buffers involved can be so large
> that it's not practical to just always make a single allocation and
> fragment it into multiple descriptors if the hardware does have some
> maximum length constraint - frankly I'm a little puzzled by the
> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>
> Robin.
Hi,Robin
We reconstruct the code as below:
It replaces dma_alloc_coherent with __get_free_pages and
dma_map_single
functions. So, we can vmap serveral ptrs returned by
__get_free_pages, right?
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
GFP_KERNEL);
if (!buf->page_list)
return -ENOMEM;
for (i = 0; i < buf->nbufs; ++i) {
ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
get_order(page_size));
if (!ptr) {
dev_err(dev, "Alloc pages error.\n");
goto err_free;
}
t = dma_map_single(dev, ptr, page_size,
DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, t)) {
dev_err(dev, "DMA mapping error.\n");
free_pages((unsigned long)ptr,
get_order(page_size));
goto err_free;
}
buf->page_list[i].buf = ptr;
buf->page_list[i].map = t;
}
pages = kmalloc_array(buf->nbufs, sizeof(*pages),
GFP_KERNEL);
if (!pages)
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
pages[i] = virt_to_page(buf->page_list[i].buf);
buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
PAGE_KERNEL);
kfree(pages);
if (!buf->direct.buf)
goto err_free;
Regards
Wei Hu
>> Regards
>> Wei Hu
>>>> } else {
>>>> kfree(d_page);
>>>> d_page = NULL;
>>>> }
>>>> return d_page;
>>>> }
>>>>
>>>> Regards
>>>> Wei Hu
>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>> +++++++++++++++++++++++++++---
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> index 3e4c525..a69cd4b 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>> goto err_free;
>>>>>>
>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>> + pages[i] =
>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>
>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>> PAGE_KERNEL);
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> gfp_t gfp_mask)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>> struct hns_roce_hem *hem;
>>>>>> struct scatterlist *mem;
>>>>>> int order;
>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>> chunk->npages = 0;
>>>>>> chunk->nsg = 0;
>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>> }
>>>>>>
>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> if (!buf)
>>>>>> goto fail;
>>>>>>
>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>> + } else {
>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + }
>>>>>> WARN_ON(mem->offset);
>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>
>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>> hns_roce_hem *hem)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>> + void *cpu_addr;
>>>>>> int i;
>>>>>>
>>>>>> if (!hem)
>>>>>> return;
>>>>>>
>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>> + else
>>>>>> + cpu_addr =
>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>> +
>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>> chunk->mem[i].length,
>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>> + cpu_addr,
>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>> + }
>>>>>> kfree(chunk);
>>>>>> }
>>>>>>
>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>> *hr_dev,
>>>>>>
>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>> + mutex_unlock(&table->mutex);
>>>>>> + return page ?
>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>> + + offset : NULL;
>>>>>> + }
>>>>>> goto out;
>>>>>> }
>>>>>> offset -= chunk->mem[i].length;
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> index af28bbf..62d712a 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>> };
>>>>>>
>>>>>> +struct hns_roce_vmalloc {
>>>>>> + bool is_vmalloc_addr;
>>>>>> + void *vmalloc_addr;
>>>>>> +};
>>>>>> +
>>>>>> struct hns_roce_hem_chunk {
>>>>>> struct list_head list;
>>>>>> int npages;
>>>>>> int nsg;
>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> };
>>>>>>
>>>>>> struct hns_roce_hem {
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> index b99d70a..9e19bf1 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>> {
>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>> struct scatterlist *sg;
>>>>>> + u64 page_addr = 0;
>>>>>> u64 *pages;
>>>>>> + int i = 0, j = 0;
>>>>>> + int len = 0;
>>>>>> int entry;
>>>>>> - int i;
>>>>>>
>>>>>> mpt_entry = mb_buf;
>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>
>>>>>> i = 0;
>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>> -
>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> - break;
>>>>>> - i++;
>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>> + for (j = 0; j < len; ++j) {
>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>> + (j << mr->umem->page_shift);
>>>>>> + pages[i] = page_addr >> 6;
>>>>>> +
>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> + goto found;
>>>>>> + i++;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> +found:
>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>> _______________________________________________
>>>> iommu mailing list
>>>> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>> .
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 2:45 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-07 2:45 UTC (permalink / raw)
To: Robin Murphy
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/11/1 20:26, Robin Murphy wrote:
> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>
>> On 2017/10/12 20:59, Robin Murphy wrote:
>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>> in the struct of hem.
>>>>>>
>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>>> ---
>>>>> Doug,
>>>>>
>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>
>>>>> Thanks
>>>> Hi, Leon & Doug
>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>> code as below:
>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>> functions in current 4.14-rcx.
>>>>
>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>> {
>>>> struct dma_page *d_page;
>>>>
>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>> if (!d_page)
>>>> return NULL;
>>>>
>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>> &d_page->dma,
>>>> pool->gfp_flags);
>>>> if (d_page->vaddr) {
>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>> else
>>>> d_page->p = virt_to_page(d_page->vaddr);
>>> There are cases on various architectures where neither of those is
>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>> another matter, of course.
>>>
>>> What definitely is a problem is if you ever take that page and end up
>>> accessing it through any virtual address other than the one explicitly
>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>> and invite data loss, right up to killing the whole system with a
>>> machine check on certain architectures.
>>>
>>> Robin.
>> Hi, Robin
>> Thanks for your comment.
>>
>> We have one problem and the related code as below.
>> 1. call dma_alloc_coherent function serval times to alloc memory.
>> 2. vmap the allocated memory pages.
>> 3. software access memory by using the return virt addr of vmap
>> and hardware using the dma addr of dma_alloc_coherent.
> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
> gives you a CPU virtual address and a DMA address with which to access
> your buffer, and that is the limit of what you may infer about it. You
> have no guarantee that the virtual address is either in the linear map
> or vmalloc, and not some other special place. You have no guarantee that
> the underlying memory even has an associated struct page at all.
>
>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>> before vmap(), it works. And when IOMMU is enabled using
>> virt_to_page() will cause calltrace later, we found the return
>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>> condition judgement statement as below, it works.
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>> Can you give us suggestion? better method?
> Oh my goodness, having now taken a closer look at this driver, I'm lost
> for words in disbelief. To pick just one example:
>
> u32 bits_per_long = BITS_PER_LONG;
> ...
> if (bits_per_long == 64) {
> /* memory mapping nonsense */
> }
>
> WTF does the size of a long have to do with DMA buffer management!?
>
> Of course I can guess that it might be trying to make some tortuous
> inference about vmalloc space being constrained on 32-bit platforms, but
> still...
>
>> The related code as below:
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>> page_size, &t,
>> GFP_KERNEL);
>> if (!buf->page_list[i].buf)
>> goto err_free;
>>
>> buf->page_list[i].map = t;
>> memset(buf->page_list[i].buf, 0, page_size);
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
> OK, this is complete crap. As above, you cannot assume that a struct
> page even exists; even if it does you cannot assume that using a
> PAGE_KERNEL mapping will not result in mismatched attributes,
> unpredictable behaviour and data loss. Trying to remap coherent DMA
> allocations like this is just egregiously wrong.
>
> What I do like is that you can seemingly fix all this by simply deleting
> hns_roce_buf::direct and all the garbage code related to it, and using
> the page_list entries consistently because the alternate paths involving
> those appear to do the right thing already.
>
> That is, of course, assuming that the buffers involved can be so large
> that it's not practical to just always make a single allocation and
> fragment it into multiple descriptors if the hardware does have some
> maximum length constraint - frankly I'm a little puzzled by the
> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>
> Robin.
Hi,Robin
We reconstruct the code as below:
It replaces dma_alloc_coherent with __get_free_pages and
dma_map_single
functions. So, we can vmap serveral ptrs returned by
__get_free_pages, right?
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
GFP_KERNEL);
if (!buf->page_list)
return -ENOMEM;
for (i = 0; i < buf->nbufs; ++i) {
ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
get_order(page_size));
if (!ptr) {
dev_err(dev, "Alloc pages error.\n");
goto err_free;
}
t = dma_map_single(dev, ptr, page_size,
DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, t)) {
dev_err(dev, "DMA mapping error.\n");
free_pages((unsigned long)ptr,
get_order(page_size));
goto err_free;
}
buf->page_list[i].buf = ptr;
buf->page_list[i].map = t;
}
pages = kmalloc_array(buf->nbufs, sizeof(*pages),
GFP_KERNEL);
if (!pages)
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
pages[i] = virt_to_page(buf->page_list[i].buf);
buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
PAGE_KERNEL);
kfree(pages);
if (!buf->direct.buf)
goto err_free;
Regards
Wei Hu
>> Regards
>> Wei Hu
>>>> } else {
>>>> kfree(d_page);
>>>> d_page = NULL;
>>>> }
>>>> return d_page;
>>>> }
>>>>
>>>> Regards
>>>> Wei Hu
>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>> +++++++++++++++++++++++++++---
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> index 3e4c525..a69cd4b 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>> goto err_free;
>>>>>>
>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>> + pages[i] =
>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>
>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>> PAGE_KERNEL);
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> gfp_t gfp_mask)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>> struct hns_roce_hem *hem;
>>>>>> struct scatterlist *mem;
>>>>>> int order;
>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>> chunk->npages = 0;
>>>>>> chunk->nsg = 0;
>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>> }
>>>>>>
>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> if (!buf)
>>>>>> goto fail;
>>>>>>
>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>> + } else {
>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + }
>>>>>> WARN_ON(mem->offset);
>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>
>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>> hns_roce_hem *hem)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>> + void *cpu_addr;
>>>>>> int i;
>>>>>>
>>>>>> if (!hem)
>>>>>> return;
>>>>>>
>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>> + else
>>>>>> + cpu_addr =
>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>> +
>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>> chunk->mem[i].length,
>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>> + cpu_addr,
>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>> + }
>>>>>> kfree(chunk);
>>>>>> }
>>>>>>
>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>> *hr_dev,
>>>>>>
>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>> + mutex_unlock(&table->mutex);
>>>>>> + return page ?
>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>> + + offset : NULL;
>>>>>> + }
>>>>>> goto out;
>>>>>> }
>>>>>> offset -= chunk->mem[i].length;
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> index af28bbf..62d712a 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>> };
>>>>>>
>>>>>> +struct hns_roce_vmalloc {
>>>>>> + bool is_vmalloc_addr;
>>>>>> + void *vmalloc_addr;
>>>>>> +};
>>>>>> +
>>>>>> struct hns_roce_hem_chunk {
>>>>>> struct list_head list;
>>>>>> int npages;
>>>>>> int nsg;
>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> };
>>>>>>
>>>>>> struct hns_roce_hem {
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> index b99d70a..9e19bf1 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>> {
>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>> struct scatterlist *sg;
>>>>>> + u64 page_addr = 0;
>>>>>> u64 *pages;
>>>>>> + int i = 0, j = 0;
>>>>>> + int len = 0;
>>>>>> int entry;
>>>>>> - int i;
>>>>>>
>>>>>> mpt_entry = mb_buf;
>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>
>>>>>> i = 0;
>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>> -
>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> - break;
>>>>>> - i++;
>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>> + for (j = 0; j < len; ++j) {
>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>> + (j << mr->umem->page_shift);
>>>>>> + pages[i] = page_addr >> 6;
>>>>>> +
>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> + goto found;
>>>>>> + i++;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> +found:
>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>> _______________________________________________
>>>> iommu mailing list
>>>> iommu@lists.linux-foundation.org
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>> .
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 2:45 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-07 2:45 UTC (permalink / raw)
To: Robin Murphy
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/11/1 20:26, Robin Murphy wrote:
> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>
>> On 2017/10/12 20:59, Robin Murphy wrote:
>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>> in the struct of hem.
>>>>>>
>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>>> ---
>>>>> Doug,
>>>>>
>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>
>>>>> Thanks
>>>> Hi, Leon & Doug
>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>> code as below:
>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>> functions in current 4.14-rcx.
>>>>
>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>> {
>>>> struct dma_page *d_page;
>>>>
>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>> if (!d_page)
>>>> return NULL;
>>>>
>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>> &d_page->dma,
>>>> pool->gfp_flags);
>>>> if (d_page->vaddr) {
>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>> else
>>>> d_page->p = virt_to_page(d_page->vaddr);
>>> There are cases on various architectures where neither of those is
>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>> another matter, of course.
>>>
>>> What definitely is a problem is if you ever take that page and end up
>>> accessing it through any virtual address other than the one explicitly
>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>> and invite data loss, right up to killing the whole system with a
>>> machine check on certain architectures.
>>>
>>> Robin.
>> Hi, Robin
>> Thanks for your comment.
>>
>> We have one problem and the related code as below.
>> 1. call dma_alloc_coherent function serval times to alloc memory.
>> 2. vmap the allocated memory pages.
>> 3. software access memory by using the return virt addr of vmap
>> and hardware using the dma addr of dma_alloc_coherent.
> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
> gives you a CPU virtual address and a DMA address with which to access
> your buffer, and that is the limit of what you may infer about it. You
> have no guarantee that the virtual address is either in the linear map
> or vmalloc, and not some other special place. You have no guarantee that
> the underlying memory even has an associated struct page at all.
>
>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>> before vmap(), it works. And when IOMMU is enabled using
>> virt_to_page() will cause calltrace later, we found the return
>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>> condition judgement statement as below, it works.
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>> Can you give us suggestion? better method?
> Oh my goodness, having now taken a closer look at this driver, I'm lost
> for words in disbelief. To pick just one example:
>
> u32 bits_per_long = BITS_PER_LONG;
> ...
> if (bits_per_long == 64) {
> /* memory mapping nonsense */
> }
>
> WTF does the size of a long have to do with DMA buffer management!?
>
> Of course I can guess that it might be trying to make some tortuous
> inference about vmalloc space being constrained on 32-bit platforms, but
> still...
>
>> The related code as below:
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>> page_size, &t,
>> GFP_KERNEL);
>> if (!buf->page_list[i].buf)
>> goto err_free;
>>
>> buf->page_list[i].map = t;
>> memset(buf->page_list[i].buf, 0, page_size);
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
> OK, this is complete crap. As above, you cannot assume that a struct
> page even exists; even if it does you cannot assume that using a
> PAGE_KERNEL mapping will not result in mismatched attributes,
> unpredictable behaviour and data loss. Trying to remap coherent DMA
> allocations like this is just egregiously wrong.
>
> What I do like is that you can seemingly fix all this by simply deleting
> hns_roce_buf::direct and all the garbage code related to it, and using
> the page_list entries consistently because the alternate paths involving
> those appear to do the right thing already.
>
> That is, of course, assuming that the buffers involved can be so large
> that it's not practical to just always make a single allocation and
> fragment it into multiple descriptors if the hardware does have some
> maximum length constraint - frankly I'm a little puzzled by the
> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>
> Robin.
Hii 1/4 ?Robin
We reconstruct the code as below:
It replaces dma_alloc_coherent with __get_free_pages and
dma_map_single
functions. So, we can vmap serveral ptrs returned by
__get_free_pages, right?
buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
GFP_KERNEL);
if (!buf->page_list)
return -ENOMEM;
for (i = 0; i < buf->nbufs; ++i) {
ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
get_order(page_size));
if (!ptr) {
dev_err(dev, "Alloc pages error.\n");
goto err_free;
}
t = dma_map_single(dev, ptr, page_size,
DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, t)) {
dev_err(dev, "DMA mapping error.\n");
free_pages((unsigned long)ptr,
get_order(page_size));
goto err_free;
}
buf->page_list[i].buf = ptr;
buf->page_list[i].map = t;
}
pages = kmalloc_array(buf->nbufs, sizeof(*pages),
GFP_KERNEL);
if (!pages)
goto err_free;
for (i = 0; i < buf->nbufs; ++i)
pages[i] = virt_to_page(buf->page_list[i].buf);
buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
PAGE_KERNEL);
kfree(pages);
if (!buf->direct.buf)
goto err_free;
Regards
Wei Hu
>> Regards
>> Wei Hu
>>>> } else {
>>>> kfree(d_page);
>>>> d_page = NULL;
>>>> }
>>>> return d_page;
>>>> }
>>>>
>>>> Regards
>>>> Wei Hu
>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>> +++++++++++++++++++++++++++---
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> index 3e4c525..a69cd4b 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>> goto err_free;
>>>>>>
>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>> + pages[i] =
>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>
>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>> PAGE_KERNEL);
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> gfp_t gfp_mask)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>> struct hns_roce_hem *hem;
>>>>>> struct scatterlist *mem;
>>>>>> int order;
>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>> chunk->npages = 0;
>>>>>> chunk->nsg = 0;
>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>> }
>>>>>>
>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> if (!buf)
>>>>>> goto fail;
>>>>>>
>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>> + } else {
>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + }
>>>>>> WARN_ON(mem->offset);
>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>
>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>> hns_roce_hem *hem)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>> + void *cpu_addr;
>>>>>> int i;
>>>>>>
>>>>>> if (!hem)
>>>>>> return;
>>>>>>
>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>> + else
>>>>>> + cpu_addr =
>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>> +
>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>> chunk->mem[i].length,
>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>> + cpu_addr,
>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>> + }
>>>>>> kfree(chunk);
>>>>>> }
>>>>>>
>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>> *hr_dev,
>>>>>>
>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>> + mutex_unlock(&table->mutex);
>>>>>> + return page ?
>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>> + + offset : NULL;
>>>>>> + }
>>>>>> goto out;
>>>>>> }
>>>>>> offset -= chunk->mem[i].length;
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> index af28bbf..62d712a 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>> };
>>>>>>
>>>>>> +struct hns_roce_vmalloc {
>>>>>> + bool is_vmalloc_addr;
>>>>>> + void *vmalloc_addr;
>>>>>> +};
>>>>>> +
>>>>>> struct hns_roce_hem_chunk {
>>>>>> struct list_head list;
>>>>>> int npages;
>>>>>> int nsg;
>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> };
>>>>>>
>>>>>> struct hns_roce_hem {
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> index b99d70a..9e19bf1 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>> {
>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>> struct scatterlist *sg;
>>>>>> + u64 page_addr = 0;
>>>>>> u64 *pages;
>>>>>> + int i = 0, j = 0;
>>>>>> + int len = 0;
>>>>>> int entry;
>>>>>> - int i;
>>>>>>
>>>>>> mpt_entry = mb_buf;
>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>
>>>>>> i = 0;
>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>> -
>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> - break;
>>>>>> - i++;
>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>> + for (j = 0; j < len; ++j) {
>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>> + (j << mr->umem->page_shift);
>>>>>> + pages[i] = page_addr >> 6;
>>>>>> +
>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> + goto found;
>>>>>> + i++;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> +found:
>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>> _______________________________________________
>>>> iommu mailing list
>>>> iommu@lists.linux-foundation.org
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>> .
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-07 2:45 ` Wei Hu (Xavier)
@ 2017-11-07 6:32 ` Leon Romanovsky
-1 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-11-07 6:32 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: shaobo.xu-ral2JQCrhuEAvxtiuMwx3w, xavier.huwei-WVlzvzqoTvw,
lijun_nudt-9Onoh4P/yGk, oulijun-hv44wF8Li93QT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw,
dledford-H+wXaHxf7aLQT0dZR+AlfA
[-- Attachment #1.1: Type: text/plain, Size: 18207 bytes --]
On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>
>
> On 2017/11/1 20:26, Robin Murphy wrote:
> > On 01/11/17 07:46, Wei Hu (Xavier) wrote:
> >>
> >> On 2017/10/12 20:59, Robin Murphy wrote:
> >>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
> >>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
> >>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
> >>>>>> If the IOMMU is enabled, the length of sg obtained from
> >>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
> >>>>>> dma address, the IOVA will not be page continuous. and the VA
> >>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
> >>>>>> the VA obtained by the page_address is a discontinuous VA. Under
> >>>>>> these circumstances, the IOVA should be calculated based on the
> >>>>>> sg length, and record the VA returned from dma_alloc_coherent
> >>>>>> in the struct of hem.
> >>>>>>
> >>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> >>>>>> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> >>>>>> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> >>>>>> ---
> >>>>> Doug,
> >>>>>
> >>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> >>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
> >>>>>
> >>>>> Thanks
> >>>> Hi, Leon & Doug
> >>>> We refered the function named __ttm_dma_alloc_page in the kernel
> >>>> code as below:
> >>>> And there are similar methods in bch_bio_map and mem_to_page
> >>>> functions in current 4.14-rcx.
> >>>>
> >>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
> >>>> {
> >>>> struct dma_page *d_page;
> >>>>
> >>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
> >>>> if (!d_page)
> >>>> return NULL;
> >>>>
> >>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
> >>>> &d_page->dma,
> >>>> pool->gfp_flags);
> >>>> if (d_page->vaddr) {
> >>>> if (is_vmalloc_addr(d_page->vaddr))
> >>>> d_page->p = vmalloc_to_page(d_page->vaddr);
> >>>> else
> >>>> d_page->p = virt_to_page(d_page->vaddr);
> >>> There are cases on various architectures where neither of those is
> >>> right. Whether those actually intersect with TTM or RDMA use-cases is
> >>> another matter, of course.
> >>>
> >>> What definitely is a problem is if you ever take that page and end up
> >>> accessing it through any virtual address other than the one explicitly
> >>> returned by dma_alloc_coherent(). That can blow the coherency wide open
> >>> and invite data loss, right up to killing the whole system with a
> >>> machine check on certain architectures.
> >>>
> >>> Robin.
> >> Hi, Robin
> >> Thanks for your comment.
> >>
> >> We have one problem and the related code as below.
> >> 1. call dma_alloc_coherent function serval times to alloc memory.
> >> 2. vmap the allocated memory pages.
> >> 3. software access memory by using the return virt addr of vmap
> >> and hardware using the dma addr of dma_alloc_coherent.
> > The simple answer is "don't do that". Seriously. dma_alloc_coherent()
> > gives you a CPU virtual address and a DMA address with which to access
> > your buffer, and that is the limit of what you may infer about it. You
> > have no guarantee that the virtual address is either in the linear map
> > or vmalloc, and not some other special place. You have no guarantee that
> > the underlying memory even has an associated struct page at all.
> >
> >> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
> >> before vmap(), it works. And when IOMMU is enabled using
> >> virt_to_page() will cause calltrace later, we found the return
> >> addr of dma_alloc_coherent is vmalloc addr, so we add the
> >> condition judgement statement as below, it works.
> >> for (i = 0; i < buf->nbufs; ++i)
> >> pages[i] =
> >> is_vmalloc_addr(buf->page_list[i].buf) ?
> >> vmalloc_to_page(buf->page_list[i].buf) :
> >> virt_to_page(buf->page_list[i].buf);
> >> Can you give us suggestion? better method?
> > Oh my goodness, having now taken a closer look at this driver, I'm lost
> > for words in disbelief. To pick just one example:
> >
> > u32 bits_per_long = BITS_PER_LONG;
> > ...
> > if (bits_per_long == 64) {
> > /* memory mapping nonsense */
> > }
> >
> > WTF does the size of a long have to do with DMA buffer management!?
> >
> > Of course I can guess that it might be trying to make some tortuous
> > inference about vmalloc space being constrained on 32-bit platforms, but
> > still...
> >
> >> The related code as below:
> >> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> >> GFP_KERNEL);
> >> if (!buf->page_list)
> >> return -ENOMEM;
> >>
> >> for (i = 0; i < buf->nbufs; ++i) {
> >> buf->page_list[i].buf = dma_alloc_coherent(dev,
> >> page_size, &t,
> >> GFP_KERNEL);
> >> if (!buf->page_list[i].buf)
> >> goto err_free;
> >>
> >> buf->page_list[i].map = t;
> >> memset(buf->page_list[i].buf, 0, page_size);
> >> }
> >>
> >> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> >> GFP_KERNEL);
> >> if (!pages)
> >> goto err_free;
> >>
> >> for (i = 0; i < buf->nbufs; ++i)
> >> pages[i] =
> >> is_vmalloc_addr(buf->page_list[i].buf) ?
> >> vmalloc_to_page(buf->page_list[i].buf) :
> >> virt_to_page(buf->page_list[i].buf);
> >>
> >> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> >> PAGE_KERNEL);
> >> kfree(pages);
> >> if (!buf->direct.buf)
> >> goto err_free;
> > OK, this is complete crap. As above, you cannot assume that a struct
> > page even exists; even if it does you cannot assume that using a
> > PAGE_KERNEL mapping will not result in mismatched attributes,
> > unpredictable behaviour and data loss. Trying to remap coherent DMA
> > allocations like this is just egregiously wrong.
> >
> > What I do like is that you can seemingly fix all this by simply deleting
> > hns_roce_buf::direct and all the garbage code related to it, and using
> > the page_list entries consistently because the alternate paths involving
> > those appear to do the right thing already.
> >
> > That is, of course, assuming that the buffers involved can be so large
> > that it's not practical to just always make a single allocation and
> > fragment it into multiple descriptors if the hardware does have some
> > maximum length constraint - frankly I'm a little puzzled by the
> > PAGE_SIZE * 2 threshold, given that that's not a fixed size.
> >
> > Robin.
> Hi,Robin
>
> We reconstruct the code as below:
> It replaces dma_alloc_coherent with __get_free_pages and
> dma_map_single
> functions. So, we can vmap serveral ptrs returned by
> __get_free_pages, right?
Most probably not, you should get rid of your virt_to_page/vmap calls.
Thanks
>
>
> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> GFP_KERNEL);
> if (!buf->page_list)
> return -ENOMEM;
>
> for (i = 0; i < buf->nbufs; ++i) {
> ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> get_order(page_size));
> if (!ptr) {
> dev_err(dev, "Alloc pages error.\n");
> goto err_free;
> }
>
> t = dma_map_single(dev, ptr, page_size,
> DMA_BIDIRECTIONAL);
> if (dma_mapping_error(dev, t)) {
> dev_err(dev, "DMA mapping error.\n");
> free_pages((unsigned long)ptr,
> get_order(page_size));
> goto err_free;
> }
>
> buf->page_list[i].buf = ptr;
> buf->page_list[i].map = t;
> }
>
> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> GFP_KERNEL);
> if (!pages)
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] = virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> kfree(pages);
> if (!buf->direct.buf)
> goto err_free;
>
>
> Regards
> Wei Hu
> >> Regards
> >> Wei Hu
> >>>> } else {
> >>>> kfree(d_page);
> >>>> d_page = NULL;
> >>>> }
> >>>> return d_page;
> >>>> }
> >>>>
> >>>> Regards
> >>>> Wei Hu
> >>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
> >>>>>> +++++++++++++++++++++++++++---
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
> >>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> index 3e4c525..a69cd4b 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
> >>>>>> *hr_dev, u32 size, u32 max_direct,
> >>>>>> goto err_free;
> >>>>>>
> >>>>>> for (i = 0; i < buf->nbufs; ++i)
> >>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
> >>>>>> + pages[i] =
> >>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
> >>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
> >>>>>> + virt_to_page(buf->page_list[i].buf);
> >>>>>>
> >>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> >>>>>> PAGE_KERNEL);
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> index 8388ae2..4a3d1d4 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> gfp_t gfp_mask)
> >>>>>> {
> >>>>>> struct hns_roce_hem_chunk *chunk = NULL;
> >>>>>> + struct hns_roce_vmalloc *vmalloc;
> >>>>>> struct hns_roce_hem *hem;
> >>>>>> struct scatterlist *mem;
> >>>>>> int order;
> >>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
> >>>>>> chunk->npages = 0;
> >>>>>> chunk->nsg = 0;
> >>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
> >>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
> >>>>>> }
> >>>>>>
> >>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> if (!buf)
> >>>>>> goto fail;
> >>>>>>
> >>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
> >>>>>> + if (is_vmalloc_addr(buf)) {
> >>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
> >>>>>> + vmalloc->is_vmalloc_addr = true;
> >>>>>> + vmalloc->vmalloc_addr = buf;
> >>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
> >>>>>> + PAGE_SIZE << order, offset_in_page(buf));
> >>>>>> + } else {
> >>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
> >>>>>> + }
> >>>>>> WARN_ON(mem->offset);
> >>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
> >>>>>>
> >>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
> >>>>>> hns_roce_hem *hem)
> >>>>>> {
> >>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
> >>>>>> + void *cpu_addr;
> >>>>>> int i;
> >>>>>>
> >>>>>> if (!hem)
> >>>>>> return;
> >>>>>>
> >>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
> >>>>>> - for (i = 0; i < chunk->npages; ++i)
> >>>>>> + for (i = 0; i < chunk->npages; ++i) {
> >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
> >>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
> >>>>>> + else
> >>>>>> + cpu_addr =
> >>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
> >>>>>> +
> >>>>>> dma_free_coherent(hr_dev->dev,
> >>>>>> chunk->mem[i].length,
> >>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
> >>>>>> + cpu_addr,
> >>>>>> sg_dma_address(&chunk->mem[i]));
> >>>>>> + }
> >>>>>> kfree(chunk);
> >>>>>> }
> >>>>>>
> >>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
> >>>>>> *hr_dev,
> >>>>>>
> >>>>>> if (chunk->mem[i].length > (u32)offset) {
> >>>>>> page = sg_page(&chunk->mem[i]);
> >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
> >>>>>> + mutex_unlock(&table->mutex);
> >>>>>> + return page ?
> >>>>>> + chunk->vmalloc[i].vmalloc_addr
> >>>>>> + + offset : NULL;
> >>>>>> + }
> >>>>>> goto out;
> >>>>>> }
> >>>>>> offset -= chunk->mem[i].length;
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> index af28bbf..62d712a 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> @@ -72,11 +72,17 @@ enum {
> >>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
> >>>>>> };
> >>>>>>
> >>>>>> +struct hns_roce_vmalloc {
> >>>>>> + bool is_vmalloc_addr;
> >>>>>> + void *vmalloc_addr;
> >>>>>> +};
> >>>>>> +
> >>>>>> struct hns_roce_hem_chunk {
> >>>>>> struct list_head list;
> >>>>>> int npages;
> >>>>>> int nsg;
> >>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
> >>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
> >>>>>> };
> >>>>>>
> >>>>>> struct hns_roce_hem {
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> index b99d70a..9e19bf1 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
> >>>>>> *mb_buf, struct hns_roce_mr *mr,
> >>>>>> {
> >>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
> >>>>>> struct scatterlist *sg;
> >>>>>> + u64 page_addr = 0;
> >>>>>> u64 *pages;
> >>>>>> + int i = 0, j = 0;
> >>>>>> + int len = 0;
> >>>>>> int entry;
> >>>>>> - int i;
> >>>>>>
> >>>>>> mpt_entry = mb_buf;
> >>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
> >>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
> >>>>>> *mb_buf, struct hns_roce_mr *mr,
> >>>>>>
> >>>>>> i = 0;
> >>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
> >>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
> >>>>>> -
> >>>>>> - /* Record the first 2 entry directly to MTPT table */
> >>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> >>>>>> - break;
> >>>>>> - i++;
> >>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
> >>>>>> + for (j = 0; j < len; ++j) {
> >>>>>> + page_addr = sg_dma_address(sg) +
> >>>>>> + (j << mr->umem->page_shift);
> >>>>>> + pages[i] = page_addr >> 6;
> >>>>>> +
> >>>>>> + /* Record the first 2 entry directly to MTPT table */
> >>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> >>>>>> + goto found;
> >>>>>> + i++;
> >>>>>> + }
> >>>>>> }
> >>>>>>
> >>>>>> +found:
> >>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
> >>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
> >>>>>> V2_MPT_BYTE_56_PA0_H_S,
> >>>>>> --
> >>>>>> 1.9.1
> >>>>>>
> >>>> _______________________________________________
> >>>> iommu mailing list
> >>>> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> >>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >>> .
> >>>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > .
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 6:32 ` Leon Romanovsky
0 siblings, 0 replies; 57+ messages in thread
From: Leon Romanovsky @ 2017-11-07 6:32 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: Robin Murphy, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
[-- Attachment #1: Type: text/plain, Size: 18038 bytes --]
On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>
>
> On 2017/11/1 20:26, Robin Murphy wrote:
> > On 01/11/17 07:46, Wei Hu (Xavier) wrote:
> >>
> >> On 2017/10/12 20:59, Robin Murphy wrote:
> >>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
> >>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
> >>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
> >>>>>> If the IOMMU is enabled, the length of sg obtained from
> >>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
> >>>>>> dma address, the IOVA will not be page continuous. and the VA
> >>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
> >>>>>> the VA obtained by the page_address is a discontinuous VA. Under
> >>>>>> these circumstances, the IOVA should be calculated based on the
> >>>>>> sg length, and record the VA returned from dma_alloc_coherent
> >>>>>> in the struct of hem.
> >>>>>>
> >>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
> >>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
> >>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
> >>>>>> ---
> >>>>> Doug,
> >>>>>
> >>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
> >>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
> >>>>>
> >>>>> Thanks
> >>>> Hi, Leon & Doug
> >>>> We refered the function named __ttm_dma_alloc_page in the kernel
> >>>> code as below:
> >>>> And there are similar methods in bch_bio_map and mem_to_page
> >>>> functions in current 4.14-rcx.
> >>>>
> >>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
> >>>> {
> >>>> struct dma_page *d_page;
> >>>>
> >>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
> >>>> if (!d_page)
> >>>> return NULL;
> >>>>
> >>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
> >>>> &d_page->dma,
> >>>> pool->gfp_flags);
> >>>> if (d_page->vaddr) {
> >>>> if (is_vmalloc_addr(d_page->vaddr))
> >>>> d_page->p = vmalloc_to_page(d_page->vaddr);
> >>>> else
> >>>> d_page->p = virt_to_page(d_page->vaddr);
> >>> There are cases on various architectures where neither of those is
> >>> right. Whether those actually intersect with TTM or RDMA use-cases is
> >>> another matter, of course.
> >>>
> >>> What definitely is a problem is if you ever take that page and end up
> >>> accessing it through any virtual address other than the one explicitly
> >>> returned by dma_alloc_coherent(). That can blow the coherency wide open
> >>> and invite data loss, right up to killing the whole system with a
> >>> machine check on certain architectures.
> >>>
> >>> Robin.
> >> Hi, Robin
> >> Thanks for your comment.
> >>
> >> We have one problem and the related code as below.
> >> 1. call dma_alloc_coherent function serval times to alloc memory.
> >> 2. vmap the allocated memory pages.
> >> 3. software access memory by using the return virt addr of vmap
> >> and hardware using the dma addr of dma_alloc_coherent.
> > The simple answer is "don't do that". Seriously. dma_alloc_coherent()
> > gives you a CPU virtual address and a DMA address with which to access
> > your buffer, and that is the limit of what you may infer about it. You
> > have no guarantee that the virtual address is either in the linear map
> > or vmalloc, and not some other special place. You have no guarantee that
> > the underlying memory even has an associated struct page at all.
> >
> >> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
> >> before vmap(), it works. And when IOMMU is enabled using
> >> virt_to_page() will cause calltrace later, we found the return
> >> addr of dma_alloc_coherent is vmalloc addr, so we add the
> >> condition judgement statement as below, it works.
> >> for (i = 0; i < buf->nbufs; ++i)
> >> pages[i] =
> >> is_vmalloc_addr(buf->page_list[i].buf) ?
> >> vmalloc_to_page(buf->page_list[i].buf) :
> >> virt_to_page(buf->page_list[i].buf);
> >> Can you give us suggestion? better method?
> > Oh my goodness, having now taken a closer look at this driver, I'm lost
> > for words in disbelief. To pick just one example:
> >
> > u32 bits_per_long = BITS_PER_LONG;
> > ...
> > if (bits_per_long == 64) {
> > /* memory mapping nonsense */
> > }
> >
> > WTF does the size of a long have to do with DMA buffer management!?
> >
> > Of course I can guess that it might be trying to make some tortuous
> > inference about vmalloc space being constrained on 32-bit platforms, but
> > still...
> >
> >> The related code as below:
> >> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> >> GFP_KERNEL);
> >> if (!buf->page_list)
> >> return -ENOMEM;
> >>
> >> for (i = 0; i < buf->nbufs; ++i) {
> >> buf->page_list[i].buf = dma_alloc_coherent(dev,
> >> page_size, &t,
> >> GFP_KERNEL);
> >> if (!buf->page_list[i].buf)
> >> goto err_free;
> >>
> >> buf->page_list[i].map = t;
> >> memset(buf->page_list[i].buf, 0, page_size);
> >> }
> >>
> >> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> >> GFP_KERNEL);
> >> if (!pages)
> >> goto err_free;
> >>
> >> for (i = 0; i < buf->nbufs; ++i)
> >> pages[i] =
> >> is_vmalloc_addr(buf->page_list[i].buf) ?
> >> vmalloc_to_page(buf->page_list[i].buf) :
> >> virt_to_page(buf->page_list[i].buf);
> >>
> >> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> >> PAGE_KERNEL);
> >> kfree(pages);
> >> if (!buf->direct.buf)
> >> goto err_free;
> > OK, this is complete crap. As above, you cannot assume that a struct
> > page even exists; even if it does you cannot assume that using a
> > PAGE_KERNEL mapping will not result in mismatched attributes,
> > unpredictable behaviour and data loss. Trying to remap coherent DMA
> > allocations like this is just egregiously wrong.
> >
> > What I do like is that you can seemingly fix all this by simply deleting
> > hns_roce_buf::direct and all the garbage code related to it, and using
> > the page_list entries consistently because the alternate paths involving
> > those appear to do the right thing already.
> >
> > That is, of course, assuming that the buffers involved can be so large
> > that it's not practical to just always make a single allocation and
> > fragment it into multiple descriptors if the hardware does have some
> > maximum length constraint - frankly I'm a little puzzled by the
> > PAGE_SIZE * 2 threshold, given that that's not a fixed size.
> >
> > Robin.
> Hi,Robin
>
> We reconstruct the code as below:
> It replaces dma_alloc_coherent with __get_free_pages and
> dma_map_single
> functions. So, we can vmap serveral ptrs returned by
> __get_free_pages, right?
Most probably not, you should get rid of your virt_to_page/vmap calls.
Thanks
>
>
> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> GFP_KERNEL);
> if (!buf->page_list)
> return -ENOMEM;
>
> for (i = 0; i < buf->nbufs; ++i) {
> ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> get_order(page_size));
> if (!ptr) {
> dev_err(dev, "Alloc pages error.\n");
> goto err_free;
> }
>
> t = dma_map_single(dev, ptr, page_size,
> DMA_BIDIRECTIONAL);
> if (dma_mapping_error(dev, t)) {
> dev_err(dev, "DMA mapping error.\n");
> free_pages((unsigned long)ptr,
> get_order(page_size));
> goto err_free;
> }
>
> buf->page_list[i].buf = ptr;
> buf->page_list[i].map = t;
> }
>
> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> GFP_KERNEL);
> if (!pages)
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] = virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> kfree(pages);
> if (!buf->direct.buf)
> goto err_free;
>
>
> Regards
> Wei Hu
> >> Regards
> >> Wei Hu
> >>>> } else {
> >>>> kfree(d_page);
> >>>> d_page = NULL;
> >>>> }
> >>>> return d_page;
> >>>> }
> >>>>
> >>>> Regards
> >>>> Wei Hu
> >>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
> >>>>>> +++++++++++++++++++++++++++---
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
> >>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> index 3e4c525..a69cd4b 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
> >>>>>> *hr_dev, u32 size, u32 max_direct,
> >>>>>> goto err_free;
> >>>>>>
> >>>>>> for (i = 0; i < buf->nbufs; ++i)
> >>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
> >>>>>> + pages[i] =
> >>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
> >>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
> >>>>>> + virt_to_page(buf->page_list[i].buf);
> >>>>>>
> >>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> >>>>>> PAGE_KERNEL);
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> index 8388ae2..4a3d1d4 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> gfp_t gfp_mask)
> >>>>>> {
> >>>>>> struct hns_roce_hem_chunk *chunk = NULL;
> >>>>>> + struct hns_roce_vmalloc *vmalloc;
> >>>>>> struct hns_roce_hem *hem;
> >>>>>> struct scatterlist *mem;
> >>>>>> int order;
> >>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
> >>>>>> chunk->npages = 0;
> >>>>>> chunk->nsg = 0;
> >>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
> >>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
> >>>>>> }
> >>>>>>
> >>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> if (!buf)
> >>>>>> goto fail;
> >>>>>>
> >>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
> >>>>>> + if (is_vmalloc_addr(buf)) {
> >>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
> >>>>>> + vmalloc->is_vmalloc_addr = true;
> >>>>>> + vmalloc->vmalloc_addr = buf;
> >>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
> >>>>>> + PAGE_SIZE << order, offset_in_page(buf));
> >>>>>> + } else {
> >>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
> >>>>>> + }
> >>>>>> WARN_ON(mem->offset);
> >>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
> >>>>>>
> >>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
> >>>>>> hns_roce_hem *hem)
> >>>>>> {
> >>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
> >>>>>> + void *cpu_addr;
> >>>>>> int i;
> >>>>>>
> >>>>>> if (!hem)
> >>>>>> return;
> >>>>>>
> >>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
> >>>>>> - for (i = 0; i < chunk->npages; ++i)
> >>>>>> + for (i = 0; i < chunk->npages; ++i) {
> >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
> >>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
> >>>>>> + else
> >>>>>> + cpu_addr =
> >>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
> >>>>>> +
> >>>>>> dma_free_coherent(hr_dev->dev,
> >>>>>> chunk->mem[i].length,
> >>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
> >>>>>> + cpu_addr,
> >>>>>> sg_dma_address(&chunk->mem[i]));
> >>>>>> + }
> >>>>>> kfree(chunk);
> >>>>>> }
> >>>>>>
> >>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
> >>>>>> *hr_dev,
> >>>>>>
> >>>>>> if (chunk->mem[i].length > (u32)offset) {
> >>>>>> page = sg_page(&chunk->mem[i]);
> >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
> >>>>>> + mutex_unlock(&table->mutex);
> >>>>>> + return page ?
> >>>>>> + chunk->vmalloc[i].vmalloc_addr
> >>>>>> + + offset : NULL;
> >>>>>> + }
> >>>>>> goto out;
> >>>>>> }
> >>>>>> offset -= chunk->mem[i].length;
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> index af28bbf..62d712a 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> @@ -72,11 +72,17 @@ enum {
> >>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
> >>>>>> };
> >>>>>>
> >>>>>> +struct hns_roce_vmalloc {
> >>>>>> + bool is_vmalloc_addr;
> >>>>>> + void *vmalloc_addr;
> >>>>>> +};
> >>>>>> +
> >>>>>> struct hns_roce_hem_chunk {
> >>>>>> struct list_head list;
> >>>>>> int npages;
> >>>>>> int nsg;
> >>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
> >>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
> >>>>>> };
> >>>>>>
> >>>>>> struct hns_roce_hem {
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> index b99d70a..9e19bf1 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
> >>>>>> *mb_buf, struct hns_roce_mr *mr,
> >>>>>> {
> >>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
> >>>>>> struct scatterlist *sg;
> >>>>>> + u64 page_addr = 0;
> >>>>>> u64 *pages;
> >>>>>> + int i = 0, j = 0;
> >>>>>> + int len = 0;
> >>>>>> int entry;
> >>>>>> - int i;
> >>>>>>
> >>>>>> mpt_entry = mb_buf;
> >>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
> >>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
> >>>>>> *mb_buf, struct hns_roce_mr *mr,
> >>>>>>
> >>>>>> i = 0;
> >>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
> >>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
> >>>>>> -
> >>>>>> - /* Record the first 2 entry directly to MTPT table */
> >>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> >>>>>> - break;
> >>>>>> - i++;
> >>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
> >>>>>> + for (j = 0; j < len; ++j) {
> >>>>>> + page_addr = sg_dma_address(sg) +
> >>>>>> + (j << mr->umem->page_shift);
> >>>>>> + pages[i] = page_addr >> 6;
> >>>>>> +
> >>>>>> + /* Record the first 2 entry directly to MTPT table */
> >>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> >>>>>> + goto found;
> >>>>>> + i++;
> >>>>>> + }
> >>>>>> }
> >>>>>>
> >>>>>> +found:
> >>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
> >>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
> >>>>>> V2_MPT_BYTE_56_PA0_H_S,
> >>>>>> --
> >>>>>> 1.9.1
> >>>>>>
> >>>> _______________________________________________
> >>>> iommu mailing list
> >>>> iommu@lists.linux-foundation.org
> >>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >>> .
> >>>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > .
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-07 2:45 ` Wei Hu (Xavier)
@ 2017-11-07 15:48 ` Jason Gunthorpe
-1 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2017-11-07 15:48 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
> We reconstruct the code as below:
> It replaces dma_alloc_coherent with __get_free_pages and
> dma_map_single functions. So, we can vmap serveral ptrs returned by
> __get_free_pages, right?
Can't you just use vmalloc and dma_map that? Other drivers follow that
approach..
However, dma_alloc_coherent and dma_map_single are not the same
thing. You can't touch the vmap memory once you call dma_map unless
the driver also includes dma cache flushing calls in all the right
places.
The difference is that alloc_coherent will return non-cachable memory
if necessary, while get_free_pages does not.
Jason
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 15:48 ` Jason Gunthorpe
0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2017-11-07 15:48 UTC (permalink / raw)
To: Wei Hu (Xavier)
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
> We reconstruct the code as below:
> It replaces dma_alloc_coherent with __get_free_pages and
> dma_map_single functions. So, we can vmap serveral ptrs returned by
> __get_free_pages, right?
Can't you just use vmalloc and dma_map that? Other drivers follow that
approach..
However, dma_alloc_coherent and dma_map_single are not the same
thing. You can't touch the vmap memory once you call dma_map unless
the driver also includes dma cache flushing calls in all the right
places.
The difference is that alloc_coherent will return non-cachable memory
if necessary, while get_free_pages does not.
Jason
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-07 15:48 ` Jason Gunthorpe
(?)
@ 2017-11-07 15:58 ` Christoph Hellwig
-1 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2017-11-07 15:58 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Wei Hu (Xavier),
Robin Murphy, Leon Romanovsky, shaobo.xu-ral2JQCrhuEAvxtiuMwx3w,
xavier.huwei-WVlzvzqoTvw, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, dledford-H+wXaHxf7aLQT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw
On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
> Can't you just use vmalloc and dma_map that? Other drivers follow that
> approach..
You can't easily due to the flushing requirements. We used to do that
in XFS and it led to problems. You need the page allocator + vmap +
invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
cache flushing right.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 15:58 ` Christoph Hellwig
0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2017-11-07 15:58 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Wei Hu (Xavier),
Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
> Can't you just use vmalloc and dma_map that? Other drivers follow that
> approach..
You can't easily due to the flushing requirements. We used to do that
in XFS and it led to problems. You need the page allocator + vmap +
invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
cache flushing right.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 15:58 ` Christoph Hellwig
0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2017-11-07 15:58 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Wei Hu (Xavier),
Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
> Can't you just use vmalloc and dma_map that? Other drivers follow that
> approach..
You can't easily due to the flushing requirements. We used to do that
in XFS and it led to problems. You need the page allocator + vmap +
invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
cache flushing right.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-07 15:58 ` Christoph Hellwig
(?)
@ 2017-11-07 16:03 ` Jason Gunthorpe
-1 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2017-11-07 16:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Wei Hu (Xavier),
shaobo.xu-ral2JQCrhuEAvxtiuMwx3w, xavier.huwei-WVlzvzqoTvw,
lijun_nudt-9Onoh4P/yGk, oulijun-hv44wF8Li93QT0dZR+AlfA,
Leon Romanovsky, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw,
dledford-H+wXaHxf7aLQT0dZR+AlfA
On Tue, Nov 07, 2017 at 07:58:05AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
> > Can't you just use vmalloc and dma_map that? Other drivers follow that
> > approach..
>
> You can't easily due to the flushing requirements. We used to do that
> in XFS and it led to problems. You need the page allocator + vmap +
> invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
> cache flushing right.
Yes, exactly something ugly like that.. :\
Jason
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 16:03 ` Jason Gunthorpe
0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2017-11-07 16:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Wei Hu (Xavier),
Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On Tue, Nov 07, 2017 at 07:58:05AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
> > Can't you just use vmalloc and dma_map that? Other drivers follow that
> > approach..
>
> You can't easily due to the flushing requirements. We used to do that
> in XFS and it led to problems. You need the page allocator + vmap +
> invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
> cache flushing right.
Yes, exactly something ugly like that.. :\
Jason
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-07 16:03 ` Jason Gunthorpe
0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2017-11-07 16:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Wei Hu (Xavier),
Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On Tue, Nov 07, 2017 at 07:58:05AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
> > Can't you just use vmalloc and dma_map that? Other drivers follow that
> > approach..
>
> You can't easily due to the flushing requirements. We used to do that
> in XFS and it led to problems. You need the page allocator + vmap +
> invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
> cache flushing right.
Yes, exactly something ugly like that.. :\
Jason
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-07 6:32 ` Leon Romanovsky
(?)
@ 2017-11-09 1:17 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:17 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Robin Murphy, shaobo.xu-ral2JQCrhuEAvxtiuMwx3w,
xavier.huwei-WVlzvzqoTvw, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, dledford-H+wXaHxf7aLQT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw
On 2017/11/7 14:32, Leon Romanovsky wrote:
> On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>>
>> On 2017/11/1 20:26, Robin Murphy wrote:
>>> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>>> On 2017/10/12 20:59, Robin Murphy wrote:
>>>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>>>> in the struct of hem.
>>>>>>>>
>>>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>>>>> Signed-off-by: Shaobo Xu <xushaobo2-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>>>>> Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>>>>>>>> ---
>>>>>>> Doug,
>>>>>>>
>>>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>>>
>>>>>>> Thanks
>>>>>> Hi, Leon & Doug
>>>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>>>> code as below:
>>>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>>>> functions in current 4.14-rcx.
>>>>>>
>>>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>>>> {
>>>>>> struct dma_page *d_page;
>>>>>>
>>>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>>>> if (!d_page)
>>>>>> return NULL;
>>>>>>
>>>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>>>> &d_page->dma,
>>>>>> pool->gfp_flags);
>>>>>> if (d_page->vaddr) {
>>>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>>>> else
>>>>>> d_page->p = virt_to_page(d_page->vaddr);
>>>>> There are cases on various architectures where neither of those is
>>>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>>>> another matter, of course.
>>>>>
>>>>> What definitely is a problem is if you ever take that page and end up
>>>>> accessing it through any virtual address other than the one explicitly
>>>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>>>> and invite data loss, right up to killing the whole system with a
>>>>> machine check on certain architectures.
>>>>>
>>>>> Robin.
>>>> Hi, Robin
>>>> Thanks for your comment.
>>>>
>>>> We have one problem and the related code as below.
>>>> 1. call dma_alloc_coherent function serval times to alloc memory.
>>>> 2. vmap the allocated memory pages.
>>>> 3. software access memory by using the return virt addr of vmap
>>>> and hardware using the dma addr of dma_alloc_coherent.
>>> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
>>> gives you a CPU virtual address and a DMA address with which to access
>>> your buffer, and that is the limit of what you may infer about it. You
>>> have no guarantee that the virtual address is either in the linear map
>>> or vmalloc, and not some other special place. You have no guarantee that
>>> the underlying memory even has an associated struct page at all.
>>>
>>>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>>>> before vmap(), it works. And when IOMMU is enabled using
>>>> virt_to_page() will cause calltrace later, we found the return
>>>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>>>> condition judgement statement as below, it works.
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>> Can you give us suggestion? better method?
>>> Oh my goodness, having now taken a closer look at this driver, I'm lost
>>> for words in disbelief. To pick just one example:
>>>
>>> u32 bits_per_long = BITS_PER_LONG;
>>> ...
>>> if (bits_per_long == 64) {
>>> /* memory mapping nonsense */
>>> }
>>>
>>> WTF does the size of a long have to do with DMA buffer management!?
>>>
>>> Of course I can guess that it might be trying to make some tortuous
>>> inference about vmalloc space being constrained on 32-bit platforms, but
>>> still...
>>>
>>>> The related code as below:
>>>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>>>> GFP_KERNEL);
>>>> if (!buf->page_list)
>>>> return -ENOMEM;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i) {
>>>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>>>> page_size, &t,
>>>> GFP_KERNEL);
>>>> if (!buf->page_list[i].buf)
>>>> goto err_free;
>>>>
>>>> buf->page_list[i].map = t;
>>>> memset(buf->page_list[i].buf, 0, page_size);
>>>> }
>>>>
>>>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>>>> GFP_KERNEL);
>>>> if (!pages)
>>>> goto err_free;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>>
>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>> PAGE_KERNEL);
>>>> kfree(pages);
>>>> if (!buf->direct.buf)
>>>> goto err_free;
>>> OK, this is complete crap. As above, you cannot assume that a struct
>>> page even exists; even if it does you cannot assume that using a
>>> PAGE_KERNEL mapping will not result in mismatched attributes,
>>> unpredictable behaviour and data loss. Trying to remap coherent DMA
>>> allocations like this is just egregiously wrong.
>>>
>>> What I do like is that you can seemingly fix all this by simply deleting
>>> hns_roce_buf::direct and all the garbage code related to it, and using
>>> the page_list entries consistently because the alternate paths involving
>>> those appear to do the right thing already.
>>>
>>> That is, of course, assuming that the buffers involved can be so large
>>> that it's not practical to just always make a single allocation and
>>> fragment it into multiple descriptors if the hardware does have some
>>> maximum length constraint - frankly I'm a little puzzled by the
>>> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>>>
>>> Robin.
>> Hi,Robin
>>
>> We reconstruct the code as below:
>> It replaces dma_alloc_coherent with __get_free_pages and
>> dma_map_single
>> functions. So, we can vmap serveral ptrs returned by
>> __get_free_pages, right?
> Most probably not, you should get rid of your virt_to_page/vmap calls.
>
> Thanks
Hi, Leon
Thanks for your suggestion.
I will send a patch to fix it.
Regards
Wei Hu
>>
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>> get_order(page_size));
>> if (!ptr) {
>> dev_err(dev, "Alloc pages error.\n");
>> goto err_free;
>> }
>>
>> t = dma_map_single(dev, ptr, page_size,
>> DMA_BIDIRECTIONAL);
>> if (dma_mapping_error(dev, t)) {
>> dev_err(dev, "DMA mapping error.\n");
>> free_pages((unsigned long)ptr,
>> get_order(page_size));
>> goto err_free;
>> }
>>
>> buf->page_list[i].buf = ptr;
>> buf->page_list[i].map = t;
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] = virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
>>
>>
>> Regards
>> Wei Hu
>>>> Regards
>>>> Wei Hu
>>>>>> } else {
>>>>>> kfree(d_page);
>>>>>> d_page = NULL;
>>>>>> }
>>>>>> return d_page;
>>>>>> }
>>>>>>
>>>>>> Regards
>>>>>> Wei Hu
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>>>> +++++++++++++++++++++++++++---
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> index 3e4c525..a69cd4b 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>>>> goto err_free;
>>>>>>>>
>>>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>>>> + pages[i] =
>>>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>>>
>>>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>>>> PAGE_KERNEL);
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> gfp_t gfp_mask)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>>>> struct hns_roce_hem *hem;
>>>>>>>> struct scatterlist *mem;
>>>>>>>> int order;
>>>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>>>> chunk->npages = 0;
>>>>>>>> chunk->nsg = 0;
>>>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> if (!buf)
>>>>>>>> goto fail;
>>>>>>>>
>>>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>>>> + } else {
>>>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + }
>>>>>>>> WARN_ON(mem->offset);
>>>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>>>
>>>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>>>> hns_roce_hem *hem)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>>>> + void *cpu_addr;
>>>>>>>> int i;
>>>>>>>>
>>>>>>>> if (!hem)
>>>>>>>> return;
>>>>>>>>
>>>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>>>> + else
>>>>>>>> + cpu_addr =
>>>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>>>> +
>>>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>>>> chunk->mem[i].length,
>>>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>>>> + cpu_addr,
>>>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>>>> + }
>>>>>>>> kfree(chunk);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>>>> *hr_dev,
>>>>>>>>
>>>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>>>> + mutex_unlock(&table->mutex);
>>>>>>>> + return page ?
>>>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>>>> + + offset : NULL;
>>>>>>>> + }
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> offset -= chunk->mem[i].length;
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> index af28bbf..62d712a 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>>>> };
>>>>>>>>
>>>>>>>> +struct hns_roce_vmalloc {
>>>>>>>> + bool is_vmalloc_addr;
>>>>>>>> + void *vmalloc_addr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> struct hns_roce_hem_chunk {
>>>>>>>> struct list_head list;
>>>>>>>> int npages;
>>>>>>>> int nsg;
>>>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> };
>>>>>>>>
>>>>>>>> struct hns_roce_hem {
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> index b99d70a..9e19bf1 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>> {
>>>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>>>> struct scatterlist *sg;
>>>>>>>> + u64 page_addr = 0;
>>>>>>>> u64 *pages;
>>>>>>>> + int i = 0, j = 0;
>>>>>>>> + int len = 0;
>>>>>>>> int entry;
>>>>>>>> - int i;
>>>>>>>>
>>>>>>>> mpt_entry = mb_buf;
>>>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>>
>>>>>>>> i = 0;
>>>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>>>> -
>>>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> - break;
>>>>>>>> - i++;
>>>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>>>> + for (j = 0; j < len; ++j) {
>>>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>>>> + (j << mr->umem->page_shift);
>>>>>>>> + pages[i] = page_addr >> 6;
>>>>>>>> +
>>>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> + goto found;
>>>>>>>> + i++;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> +found:
>>>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> iommu mailing list
>>>>>> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>>> .
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> .
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-09 1:17 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:17 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Robin Murphy, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/11/7 14:32, Leon Romanovsky wrote:
> On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>>
>> On 2017/11/1 20:26, Robin Murphy wrote:
>>> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>>> On 2017/10/12 20:59, Robin Murphy wrote:
>>>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>>>> in the struct of hem.
>>>>>>>>
>>>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>>>>> ---
>>>>>>> Doug,
>>>>>>>
>>>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>>>
>>>>>>> Thanks
>>>>>> Hi, Leon & Doug
>>>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>>>> code as below:
>>>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>>>> functions in current 4.14-rcx.
>>>>>>
>>>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>>>> {
>>>>>> struct dma_page *d_page;
>>>>>>
>>>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>>>> if (!d_page)
>>>>>> return NULL;
>>>>>>
>>>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>>>> &d_page->dma,
>>>>>> pool->gfp_flags);
>>>>>> if (d_page->vaddr) {
>>>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>>>> else
>>>>>> d_page->p = virt_to_page(d_page->vaddr);
>>>>> There are cases on various architectures where neither of those is
>>>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>>>> another matter, of course.
>>>>>
>>>>> What definitely is a problem is if you ever take that page and end up
>>>>> accessing it through any virtual address other than the one explicitly
>>>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>>>> and invite data loss, right up to killing the whole system with a
>>>>> machine check on certain architectures.
>>>>>
>>>>> Robin.
>>>> Hi, Robin
>>>> Thanks for your comment.
>>>>
>>>> We have one problem and the related code as below.
>>>> 1. call dma_alloc_coherent function serval times to alloc memory.
>>>> 2. vmap the allocated memory pages.
>>>> 3. software access memory by using the return virt addr of vmap
>>>> and hardware using the dma addr of dma_alloc_coherent.
>>> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
>>> gives you a CPU virtual address and a DMA address with which to access
>>> your buffer, and that is the limit of what you may infer about it. You
>>> have no guarantee that the virtual address is either in the linear map
>>> or vmalloc, and not some other special place. You have no guarantee that
>>> the underlying memory even has an associated struct page at all.
>>>
>>>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>>>> before vmap(), it works. And when IOMMU is enabled using
>>>> virt_to_page() will cause calltrace later, we found the return
>>>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>>>> condition judgement statement as below, it works.
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>> Can you give us suggestion? better method?
>>> Oh my goodness, having now taken a closer look at this driver, I'm lost
>>> for words in disbelief. To pick just one example:
>>>
>>> u32 bits_per_long = BITS_PER_LONG;
>>> ...
>>> if (bits_per_long == 64) {
>>> /* memory mapping nonsense */
>>> }
>>>
>>> WTF does the size of a long have to do with DMA buffer management!?
>>>
>>> Of course I can guess that it might be trying to make some tortuous
>>> inference about vmalloc space being constrained on 32-bit platforms, but
>>> still...
>>>
>>>> The related code as below:
>>>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>>>> GFP_KERNEL);
>>>> if (!buf->page_list)
>>>> return -ENOMEM;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i) {
>>>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>>>> page_size, &t,
>>>> GFP_KERNEL);
>>>> if (!buf->page_list[i].buf)
>>>> goto err_free;
>>>>
>>>> buf->page_list[i].map = t;
>>>> memset(buf->page_list[i].buf, 0, page_size);
>>>> }
>>>>
>>>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>>>> GFP_KERNEL);
>>>> if (!pages)
>>>> goto err_free;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>>
>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>> PAGE_KERNEL);
>>>> kfree(pages);
>>>> if (!buf->direct.buf)
>>>> goto err_free;
>>> OK, this is complete crap. As above, you cannot assume that a struct
>>> page even exists; even if it does you cannot assume that using a
>>> PAGE_KERNEL mapping will not result in mismatched attributes,
>>> unpredictable behaviour and data loss. Trying to remap coherent DMA
>>> allocations like this is just egregiously wrong.
>>>
>>> What I do like is that you can seemingly fix all this by simply deleting
>>> hns_roce_buf::direct and all the garbage code related to it, and using
>>> the page_list entries consistently because the alternate paths involving
>>> those appear to do the right thing already.
>>>
>>> That is, of course, assuming that the buffers involved can be so large
>>> that it's not practical to just always make a single allocation and
>>> fragment it into multiple descriptors if the hardware does have some
>>> maximum length constraint - frankly I'm a little puzzled by the
>>> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>>>
>>> Robin.
>> Hi,Robin
>>
>> We reconstruct the code as below:
>> It replaces dma_alloc_coherent with __get_free_pages and
>> dma_map_single
>> functions. So, we can vmap serveral ptrs returned by
>> __get_free_pages, right?
> Most probably not, you should get rid of your virt_to_page/vmap calls.
>
> Thanks
Hi, Leon
Thanks for your suggestion.
I will send a patch to fix it.
Regards
Wei Hu
>>
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>> get_order(page_size));
>> if (!ptr) {
>> dev_err(dev, "Alloc pages error.\n");
>> goto err_free;
>> }
>>
>> t = dma_map_single(dev, ptr, page_size,
>> DMA_BIDIRECTIONAL);
>> if (dma_mapping_error(dev, t)) {
>> dev_err(dev, "DMA mapping error.\n");
>> free_pages((unsigned long)ptr,
>> get_order(page_size));
>> goto err_free;
>> }
>>
>> buf->page_list[i].buf = ptr;
>> buf->page_list[i].map = t;
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] = virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
>>
>>
>> Regards
>> Wei Hu
>>>> Regards
>>>> Wei Hu
>>>>>> } else {
>>>>>> kfree(d_page);
>>>>>> d_page = NULL;
>>>>>> }
>>>>>> return d_page;
>>>>>> }
>>>>>>
>>>>>> Regards
>>>>>> Wei Hu
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>>>> +++++++++++++++++++++++++++---
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> index 3e4c525..a69cd4b 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>>>> goto err_free;
>>>>>>>>
>>>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>>>> + pages[i] =
>>>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>>>
>>>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>>>> PAGE_KERNEL);
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> gfp_t gfp_mask)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>>>> struct hns_roce_hem *hem;
>>>>>>>> struct scatterlist *mem;
>>>>>>>> int order;
>>>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>>>> chunk->npages = 0;
>>>>>>>> chunk->nsg = 0;
>>>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> if (!buf)
>>>>>>>> goto fail;
>>>>>>>>
>>>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>>>> + } else {
>>>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + }
>>>>>>>> WARN_ON(mem->offset);
>>>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>>>
>>>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>>>> hns_roce_hem *hem)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>>>> + void *cpu_addr;
>>>>>>>> int i;
>>>>>>>>
>>>>>>>> if (!hem)
>>>>>>>> return;
>>>>>>>>
>>>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>>>> + else
>>>>>>>> + cpu_addr =
>>>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>>>> +
>>>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>>>> chunk->mem[i].length,
>>>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>>>> + cpu_addr,
>>>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>>>> + }
>>>>>>>> kfree(chunk);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>>>> *hr_dev,
>>>>>>>>
>>>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>>>> + mutex_unlock(&table->mutex);
>>>>>>>> + return page ?
>>>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>>>> + + offset : NULL;
>>>>>>>> + }
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> offset -= chunk->mem[i].length;
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> index af28bbf..62d712a 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>>>> };
>>>>>>>>
>>>>>>>> +struct hns_roce_vmalloc {
>>>>>>>> + bool is_vmalloc_addr;
>>>>>>>> + void *vmalloc_addr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> struct hns_roce_hem_chunk {
>>>>>>>> struct list_head list;
>>>>>>>> int npages;
>>>>>>>> int nsg;
>>>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> };
>>>>>>>>
>>>>>>>> struct hns_roce_hem {
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> index b99d70a..9e19bf1 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>> {
>>>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>>>> struct scatterlist *sg;
>>>>>>>> + u64 page_addr = 0;
>>>>>>>> u64 *pages;
>>>>>>>> + int i = 0, j = 0;
>>>>>>>> + int len = 0;
>>>>>>>> int entry;
>>>>>>>> - int i;
>>>>>>>>
>>>>>>>> mpt_entry = mb_buf;
>>>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>>
>>>>>>>> i = 0;
>>>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>>>> -
>>>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> - break;
>>>>>>>> - i++;
>>>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>>>> + for (j = 0; j < len; ++j) {
>>>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>>>> + (j << mr->umem->page_shift);
>>>>>>>> + pages[i] = page_addr >> 6;
>>>>>>>> +
>>>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> + goto found;
>>>>>>>> + i++;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> +found:
>>>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> iommu mailing list
>>>>>> iommu@lists.linux-foundation.org
>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>>> .
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> .
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-09 1:17 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:17 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Robin Murphy, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/11/7 14:32, Leon Romanovsky wrote:
> On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>>
>> On 2017/11/1 20:26, Robin Murphy wrote:
>>> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>>> On 2017/10/12 20:59, Robin Murphy wrote:
>>>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>>>> in the struct of hem.
>>>>>>>>
>>>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>>>>> ---
>>>>>>> Doug,
>>>>>>>
>>>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>>>
>>>>>>> Thanks
>>>>>> Hi, Leon & Doug
>>>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>>>> code as below:
>>>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>>>> functions in current 4.14-rcx.
>>>>>>
>>>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>>>> {
>>>>>> struct dma_page *d_page;
>>>>>>
>>>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>>>> if (!d_page)
>>>>>> return NULL;
>>>>>>
>>>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>>>> &d_page->dma,
>>>>>> pool->gfp_flags);
>>>>>> if (d_page->vaddr) {
>>>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>>>> else
>>>>>> d_page->p = virt_to_page(d_page->vaddr);
>>>>> There are cases on various architectures where neither of those is
>>>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>>>> another matter, of course.
>>>>>
>>>>> What definitely is a problem is if you ever take that page and end up
>>>>> accessing it through any virtual address other than the one explicitly
>>>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>>>> and invite data loss, right up to killing the whole system with a
>>>>> machine check on certain architectures.
>>>>>
>>>>> Robin.
>>>> Hi, Robin
>>>> Thanks for your comment.
>>>>
>>>> We have one problem and the related code as below.
>>>> 1. call dma_alloc_coherent function serval times to alloc memory.
>>>> 2. vmap the allocated memory pages.
>>>> 3. software access memory by using the return virt addr of vmap
>>>> and hardware using the dma addr of dma_alloc_coherent.
>>> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
>>> gives you a CPU virtual address and a DMA address with which to access
>>> your buffer, and that is the limit of what you may infer about it. You
>>> have no guarantee that the virtual address is either in the linear map
>>> or vmalloc, and not some other special place. You have no guarantee that
>>> the underlying memory even has an associated struct page at all.
>>>
>>>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>>>> before vmap(), it works. And when IOMMU is enabled using
>>>> virt_to_page() will cause calltrace later, we found the return
>>>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>>>> condition judgement statement as below, it works.
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>> Can you give us suggestion? better method?
>>> Oh my goodness, having now taken a closer look at this driver, I'm lost
>>> for words in disbelief. To pick just one example:
>>>
>>> u32 bits_per_long = BITS_PER_LONG;
>>> ...
>>> if (bits_per_long == 64) {
>>> /* memory mapping nonsense */
>>> }
>>>
>>> WTF does the size of a long have to do with DMA buffer management!?
>>>
>>> Of course I can guess that it might be trying to make some tortuous
>>> inference about vmalloc space being constrained on 32-bit platforms, but
>>> still...
>>>
>>>> The related code as below:
>>>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>>>> GFP_KERNEL);
>>>> if (!buf->page_list)
>>>> return -ENOMEM;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i) {
>>>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>>>> page_size, &t,
>>>> GFP_KERNEL);
>>>> if (!buf->page_list[i].buf)
>>>> goto err_free;
>>>>
>>>> buf->page_list[i].map = t;
>>>> memset(buf->page_list[i].buf, 0, page_size);
>>>> }
>>>>
>>>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>>>> GFP_KERNEL);
>>>> if (!pages)
>>>> goto err_free;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>>
>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>> PAGE_KERNEL);
>>>> kfree(pages);
>>>> if (!buf->direct.buf)
>>>> goto err_free;
>>> OK, this is complete crap. As above, you cannot assume that a struct
>>> page even exists; even if it does you cannot assume that using a
>>> PAGE_KERNEL mapping will not result in mismatched attributes,
>>> unpredictable behaviour and data loss. Trying to remap coherent DMA
>>> allocations like this is just egregiously wrong.
>>>
>>> What I do like is that you can seemingly fix all this by simply deleting
>>> hns_roce_buf::direct and all the garbage code related to it, and using
>>> the page_list entries consistently because the alternate paths involving
>>> those appear to do the right thing already.
>>>
>>> That is, of course, assuming that the buffers involved can be so large
>>> that it's not practical to just always make a single allocation and
>>> fragment it into multiple descriptors if the hardware does have some
>>> maximum length constraint - frankly I'm a little puzzled by the
>>> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>>>
>>> Robin.
>> Hii 1/4 ?Robin
>>
>> We reconstruct the code as below:
>> It replaces dma_alloc_coherent with __get_free_pages and
>> dma_map_single
>> functions. So, we can vmap serveral ptrs returned by
>> __get_free_pages, right?
> Most probably not, you should get rid of your virt_to_page/vmap calls.
>
> Thanks
Hi, Leon
Thanks for your suggestion.
I will send a patch to fix it.
Regards
Wei Hu
>>
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>> get_order(page_size));
>> if (!ptr) {
>> dev_err(dev, "Alloc pages error.\n");
>> goto err_free;
>> }
>>
>> t = dma_map_single(dev, ptr, page_size,
>> DMA_BIDIRECTIONAL);
>> if (dma_mapping_error(dev, t)) {
>> dev_err(dev, "DMA mapping error.\n");
>> free_pages((unsigned long)ptr,
>> get_order(page_size));
>> goto err_free;
>> }
>>
>> buf->page_list[i].buf = ptr;
>> buf->page_list[i].map = t;
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] = virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
>>
>>
>> Regards
>> Wei Hu
>>>> Regards
>>>> Wei Hu
>>>>>> } else {
>>>>>> kfree(d_page);
>>>>>> d_page = NULL;
>>>>>> }
>>>>>> return d_page;
>>>>>> }
>>>>>>
>>>>>> Regards
>>>>>> Wei Hu
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>>>> +++++++++++++++++++++++++++---
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> index 3e4c525..a69cd4b 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>>>> goto err_free;
>>>>>>>>
>>>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>>>> + pages[i] =
>>>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>>>
>>>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>>>> PAGE_KERNEL);
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> gfp_t gfp_mask)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>>>> struct hns_roce_hem *hem;
>>>>>>>> struct scatterlist *mem;
>>>>>>>> int order;
>>>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>>>> chunk->npages = 0;
>>>>>>>> chunk->nsg = 0;
>>>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> if (!buf)
>>>>>>>> goto fail;
>>>>>>>>
>>>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>>>> + } else {
>>>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + }
>>>>>>>> WARN_ON(mem->offset);
>>>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>>>
>>>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>>>> hns_roce_hem *hem)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>>>> + void *cpu_addr;
>>>>>>>> int i;
>>>>>>>>
>>>>>>>> if (!hem)
>>>>>>>> return;
>>>>>>>>
>>>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>>>> + else
>>>>>>>> + cpu_addr =
>>>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>>>> +
>>>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>>>> chunk->mem[i].length,
>>>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>>>> + cpu_addr,
>>>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>>>> + }
>>>>>>>> kfree(chunk);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>>>> *hr_dev,
>>>>>>>>
>>>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>>>> + mutex_unlock(&table->mutex);
>>>>>>>> + return page ?
>>>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>>>> + + offset : NULL;
>>>>>>>> + }
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> offset -= chunk->mem[i].length;
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> index af28bbf..62d712a 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>>>> };
>>>>>>>>
>>>>>>>> +struct hns_roce_vmalloc {
>>>>>>>> + bool is_vmalloc_addr;
>>>>>>>> + void *vmalloc_addr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> struct hns_roce_hem_chunk {
>>>>>>>> struct list_head list;
>>>>>>>> int npages;
>>>>>>>> int nsg;
>>>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> };
>>>>>>>>
>>>>>>>> struct hns_roce_hem {
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> index b99d70a..9e19bf1 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>> {
>>>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>>>> struct scatterlist *sg;
>>>>>>>> + u64 page_addr = 0;
>>>>>>>> u64 *pages;
>>>>>>>> + int i = 0, j = 0;
>>>>>>>> + int len = 0;
>>>>>>>> int entry;
>>>>>>>> - int i;
>>>>>>>>
>>>>>>>> mpt_entry = mb_buf;
>>>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>>
>>>>>>>> i = 0;
>>>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>>>> -
>>>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> - break;
>>>>>>>> - i++;
>>>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>>>> + for (j = 0; j < len; ++j) {
>>>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>>>> + (j << mr->umem->page_shift);
>>>>>>>> + pages[i] = page_addr >> 6;
>>>>>>>> +
>>>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> + goto found;
>>>>>>>> + i++;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> +found:
>>>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> iommu mailing list
>>>>>> iommu@lists.linux-foundation.org
>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>>> .
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> .
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-07 15:58 ` Christoph Hellwig
(?)
@ 2017-11-09 1:26 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:26 UTC (permalink / raw)
To: Christoph Hellwig, Jason Gunthorpe
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu-ral2JQCrhuEAvxtiuMwx3w,
xavier.huwei-WVlzvzqoTvw, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, dledford-H+wXaHxf7aLQT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw
On 2017/11/7 23:58, Christoph Hellwig wrote:
> On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
>> Can't you just use vmalloc and dma_map that? Other drivers follow that
>> approach..
> You can't easily due to the flushing requirements. We used to do that
> in XFS and it led to problems. You need the page allocator + vmap +
> invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
> cache flushing right.
>
> .
Hi, Christoph Hellwig
Thanks for your suggestion.
Regards
Wei Hu
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-09 1:26 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:26 UTC (permalink / raw)
To: Christoph Hellwig, Jason Gunthorpe
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On 2017/11/7 23:58, Christoph Hellwig wrote:
> On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
>> Can't you just use vmalloc and dma_map that? Other drivers follow that
>> approach..
> You can't easily due to the flushing requirements. We used to do that
> in XFS and it led to problems. You need the page allocator + vmap +
> invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
> cache flushing right.
>
> .
Hi, Christoph Hellwig
Thanks for your suggestion.
Regards
Wei Hu
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-09 1:26 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:26 UTC (permalink / raw)
To: Christoph Hellwig, Jason Gunthorpe
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On 2017/11/7 23:58, Christoph Hellwig wrote:
> On Tue, Nov 07, 2017 at 08:48:38AM -0700, Jason Gunthorpe wrote:
>> Can't you just use vmalloc and dma_map that? Other drivers follow that
>> approach..
> You can't easily due to the flushing requirements. We used to do that
> in XFS and it led to problems. You need the page allocator + vmap +
> invalidate_kernel_vmap_range + flush_kernel_vmap_range to get the
> cache flushing right.
>
> .
Hi, Christoph Hellwig
Thanks for your suggestion.
Regards
Wei Hu
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-07 15:48 ` Jason Gunthorpe
(?)
@ 2017-11-09 1:30 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu-ral2JQCrhuEAvxtiuMwx3w,
xavier.huwei-WVlzvzqoTvw, lijun_nudt-9Onoh4P/yGk,
oulijun-hv44wF8Li93QT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
charles.chenxin-hv44wF8Li93QT0dZR+AlfA,
linuxarm-hv44wF8Li93QT0dZR+AlfA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, dledford-H+wXaHxf7aLQT0dZR+AlfA,
liuyixian-hv44wF8Li93QT0dZR+AlfA,
zhangxiping3-hv44wF8Li93QT0dZR+AlfA, shaoboxu-WVlzvzqoTvw
On 2017/11/7 23:48, Jason Gunthorpe wrote:
> On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>
>> We reconstruct the code as below:
>> It replaces dma_alloc_coherent with __get_free_pages and
>> dma_map_single functions. So, we can vmap serveral ptrs returned by
>> __get_free_pages, right?
> Can't you just use vmalloc and dma_map that? Other drivers follow that
> approach..
>
> However, dma_alloc_coherent and dma_map_single are not the same
> thing. You can't touch the vmap memory once you call dma_map unless
> the driver also includes dma cache flushing calls in all the right
> places.
>
> The difference is that alloc_coherent will return non-cachable memory
> if necessary, while get_free_pages does not.
>
> Jason
Hi, Jason
Thanks for your suggestion.
We will fix it.
Regards
Wei Hu
>
> .
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-09 1:30 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On 2017/11/7 23:48, Jason Gunthorpe wrote:
> On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>
>> We reconstruct the code as below:
>> It replaces dma_alloc_coherent with __get_free_pages and
>> dma_map_single functions. So, we can vmap serveral ptrs returned by
>> __get_free_pages, right?
> Can't you just use vmalloc and dma_map that? Other drivers follow that
> approach..
>
> However, dma_alloc_coherent and dma_map_single are not the same
> thing. You can't touch the vmap memory once you call dma_map unless
> the driver also includes dma cache flushing calls in all the right
> places.
>
> The difference is that alloc_coherent will return non-cachable memory
> if necessary, while get_free_pages does not.
>
> Jason
Hi, Jason
Thanks for your suggestion.
We will fix it.
Regards
Wei Hu
>
> .
>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-09 1:30 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Robin Murphy, Leon Romanovsky, shaobo.xu, xavier.huwei,
lijun_nudt, oulijun, linux-rdma, charles.chenxin, linuxarm,
iommu, linux-kernel, linux-mm, dledford, liuyixian, zhangxiping3,
shaoboxu
On 2017/11/7 23:48, Jason Gunthorpe wrote:
> On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>
>> We reconstruct the code as below:
>> It replaces dma_alloc_coherent with __get_free_pages and
>> dma_map_single functions. So, we can vmap serveral ptrs returned by
>> __get_free_pages, right?
> Can't you just use vmalloc and dma_map that? Other drivers follow that
> approach..
>
> However, dma_alloc_coherent and dma_map_single are not the same
> thing. You can't touch the vmap memory once you call dma_map unless
> the driver also includes dma cache flushing calls in all the right
> places.
>
> The difference is that alloc_coherent will return non-cachable memory
> if necessary, while get_free_pages does not.
>
> Jason
Hi, Jason
Thanks for your suggestion.
We will fix it.
Regards
Wei Hu
>
> .
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
2017-11-01 12:26 ` Robin Murphy
@ 2017-11-09 1:36 ` Wei Hu (Xavier)
-1 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:36 UTC (permalink / raw)
To: Robin Murphy
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/11/1 20:26, Robin Murphy wrote:
> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>
>> On 2017/10/12 20:59, Robin Murphy wrote:
>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>> in the struct of hem.
>>>>>>
>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>>> ---
>>>>> Doug,
>>>>>
>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>
>>>>> Thanks
>>>> Hi, Leon & Doug
>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>> code as below:
>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>> functions in current 4.14-rcx.
>>>>
>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>> {
>>>> struct dma_page *d_page;
>>>>
>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>> if (!d_page)
>>>> return NULL;
>>>>
>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>> &d_page->dma,
>>>> pool->gfp_flags);
>>>> if (d_page->vaddr) {
>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>> else
>>>> d_page->p = virt_to_page(d_page->vaddr);
>>> There are cases on various architectures where neither of those is
>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>> another matter, of course.
>>>
>>> What definitely is a problem is if you ever take that page and end up
>>> accessing it through any virtual address other than the one explicitly
>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>> and invite data loss, right up to killing the whole system with a
>>> machine check on certain architectures.
>>>
>>> Robin.
>> Hi, Robin
>> Thanks for your comment.
>>
>> We have one problem and the related code as below.
>> 1. call dma_alloc_coherent function serval times to alloc memory.
>> 2. vmap the allocated memory pages.
>> 3. software access memory by using the return virt addr of vmap
>> and hardware using the dma addr of dma_alloc_coherent.
> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
> gives you a CPU virtual address and a DMA address with which to access
> your buffer, and that is the limit of what you may infer about it. You
> have no guarantee that the virtual address is either in the linear map
> or vmalloc, and not some other special place. You have no guarantee that
> the underlying memory even has an associated struct page at all.
>
>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>> before vmap(), it works. And when IOMMU is enabled using
>> virt_to_page() will cause calltrace later, we found the return
>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>> condition judgement statement as below, it works.
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>> Can you give us suggestion? better method?
> Oh my goodness, having now taken a closer look at this driver, I'm lost
> for words in disbelief. To pick just one example:
>
> u32 bits_per_long = BITS_PER_LONG;
> ...
> if (bits_per_long == 64) {
> /* memory mapping nonsense */
> }
>
> WTF does the size of a long have to do with DMA buffer management!?
>
> Of course I can guess that it might be trying to make some tortuous
> inference about vmalloc space being constrained on 32-bit platforms, but
> still...
We will fix it. Thanks
>> The related code as below:
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>> page_size, &t,
>> GFP_KERNEL);
>> if (!buf->page_list[i].buf)
>> goto err_free;
>>
>> buf->page_list[i].map = t;
>> memset(buf->page_list[i].buf, 0, page_size);
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
> OK, this is complete crap. As above, you cannot assume that a struct
> page even exists; even if it does you cannot assume that using a
> PAGE_KERNEL mapping will not result in mismatched attributes,
> unpredictable behaviour and data loss. Trying to remap coherent DMA
> allocations like this is just egregiously wrong.
>
> What I do like is that you can seemingly fix all this by simply deleting
> hns_roce_buf::direct and all the garbage code related to it, and using
> the page_list entries consistently because the alternate paths involving
> those appear to do the right thing already.
Hi, Robin
Thanks for your suggestion.
We will fix it.
Regards
Wei Hu
> That is, of course, assuming that the buffers involved can be so large
> that it's not practical to just always make a single allocation and
> fragment it into multiple descriptors if the hardware does have some
> maximum length constraint - frankly I'm a little puzzled by the
> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>
> Robin.
>
>> Regards
>> Wei Hu
>>>> } else {
>>>> kfree(d_page);
>>>> d_page = NULL;
>>>> }
>>>> return d_page;
>>>> }
>>>>
>>>> Regards
>>>> Wei Hu
>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>> +++++++++++++++++++++++++++---
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> index 3e4c525..a69cd4b 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>> goto err_free;
>>>>>>
>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>> + pages[i] =
>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>
>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>> PAGE_KERNEL);
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> gfp_t gfp_mask)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>> struct hns_roce_hem *hem;
>>>>>> struct scatterlist *mem;
>>>>>> int order;
>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>> chunk->npages = 0;
>>>>>> chunk->nsg = 0;
>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>> }
>>>>>>
>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> if (!buf)
>>>>>> goto fail;
>>>>>>
>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>> + } else {
>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + }
>>>>>> WARN_ON(mem->offset);
>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>
>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>> hns_roce_hem *hem)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>> + void *cpu_addr;
>>>>>> int i;
>>>>>>
>>>>>> if (!hem)
>>>>>> return;
>>>>>>
>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>> + else
>>>>>> + cpu_addr =
>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>> +
>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>> chunk->mem[i].length,
>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>> + cpu_addr,
>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>> + }
>>>>>> kfree(chunk);
>>>>>> }
>>>>>>
>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>> *hr_dev,
>>>>>>
>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>> + mutex_unlock(&table->mutex);
>>>>>> + return page ?
>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>> + + offset : NULL;
>>>>>> + }
>>>>>> goto out;
>>>>>> }
>>>>>> offset -= chunk->mem[i].length;
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> index af28bbf..62d712a 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>> };
>>>>>>
>>>>>> +struct hns_roce_vmalloc {
>>>>>> + bool is_vmalloc_addr;
>>>>>> + void *vmalloc_addr;
>>>>>> +};
>>>>>> +
>>>>>> struct hns_roce_hem_chunk {
>>>>>> struct list_head list;
>>>>>> int npages;
>>>>>> int nsg;
>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> };
>>>>>>
>>>>>> struct hns_roce_hem {
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> index b99d70a..9e19bf1 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>> {
>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>> struct scatterlist *sg;
>>>>>> + u64 page_addr = 0;
>>>>>> u64 *pages;
>>>>>> + int i = 0, j = 0;
>>>>>> + int len = 0;
>>>>>> int entry;
>>>>>> - int i;
>>>>>>
>>>>>> mpt_entry = mb_buf;
>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>
>>>>>> i = 0;
>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>> -
>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> - break;
>>>>>> - i++;
>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>> + for (j = 0; j < len; ++j) {
>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>> + (j << mr->umem->page_shift);
>>>>>> + pages[i] = page_addr >> 6;
>>>>>> +
>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> + goto found;
>>>>>> + i++;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> +found:
>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>> _______________________________________________
>>>> iommu mailing list
>>>> iommu@lists.linux-foundation.org
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>> .
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08
@ 2017-11-09 1:36 ` Wei Hu (Xavier)
0 siblings, 0 replies; 57+ messages in thread
From: Wei Hu (Xavier) @ 2017-11-09 1:36 UTC (permalink / raw)
To: Robin Murphy
Cc: Leon Romanovsky, shaobo.xu, xavier.huwei, lijun_nudt, oulijun,
linux-rdma, charles.chenxin, linuxarm, iommu, linux-kernel,
linux-mm, dledford, liuyixian, zhangxiping3, shaoboxu
On 2017/11/1 20:26, Robin Murphy wrote:
> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>
>> On 2017/10/12 20:59, Robin Murphy wrote:
>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>> in the struct of hem.
>>>>>>
>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
>>>>>> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
>>>>>> Signed-off-by: Lijun Ou <oulijun@huawei.com>
>>>>>> ---
>>>>> Doug,
>>>>>
>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>
>>>>> Thanks
>>>> Hi, Leon & Doug
>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>> code as below:
>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>> functions in current 4.14-rcx.
>>>>
>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>> {
>>>> struct dma_page *d_page;
>>>>
>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>> if (!d_page)
>>>> return NULL;
>>>>
>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>> &d_page->dma,
>>>> pool->gfp_flags);
>>>> if (d_page->vaddr) {
>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>> else
>>>> d_page->p = virt_to_page(d_page->vaddr);
>>> There are cases on various architectures where neither of those is
>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>> another matter, of course.
>>>
>>> What definitely is a problem is if you ever take that page and end up
>>> accessing it through any virtual address other than the one explicitly
>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>> and invite data loss, right up to killing the whole system with a
>>> machine check on certain architectures.
>>>
>>> Robin.
>> Hi, Robin
>> Thanks for your comment.
>>
>> We have one problem and the related code as below.
>> 1. call dma_alloc_coherent function serval times to alloc memory.
>> 2. vmap the allocated memory pages.
>> 3. software access memory by using the return virt addr of vmap
>> and hardware using the dma addr of dma_alloc_coherent.
> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
> gives you a CPU virtual address and a DMA address with which to access
> your buffer, and that is the limit of what you may infer about it. You
> have no guarantee that the virtual address is either in the linear map
> or vmalloc, and not some other special place. You have no guarantee that
> the underlying memory even has an associated struct page at all.
>
>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>> before vmap(), it works. And when IOMMU is enabled using
>> virt_to_page() will cause calltrace later, we found the return
>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>> condition judgement statement as below, it works.
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>> Can you give us suggestion? better method?
> Oh my goodness, having now taken a closer look at this driver, I'm lost
> for words in disbelief. To pick just one example:
>
> u32 bits_per_long = BITS_PER_LONG;
> ...
> if (bits_per_long == 64) {
> /* memory mapping nonsense */
> }
>
> WTF does the size of a long have to do with DMA buffer management!?
>
> Of course I can guess that it might be trying to make some tortuous
> inference about vmalloc space being constrained on 32-bit platforms, but
> still...
We will fix it. Thanks
>> The related code as below:
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>> page_size, &t,
>> GFP_KERNEL);
>> if (!buf->page_list[i].buf)
>> goto err_free;
>>
>> buf->page_list[i].map = t;
>> memset(buf->page_list[i].buf, 0, page_size);
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] =
>> is_vmalloc_addr(buf->page_list[i].buf) ?
>> vmalloc_to_page(buf->page_list[i].buf) :
>> virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
> OK, this is complete crap. As above, you cannot assume that a struct
> page even exists; even if it does you cannot assume that using a
> PAGE_KERNEL mapping will not result in mismatched attributes,
> unpredictable behaviour and data loss. Trying to remap coherent DMA
> allocations like this is just egregiously wrong.
>
> What I do like is that you can seemingly fix all this by simply deleting
> hns_roce_buf::direct and all the garbage code related to it, and using
> the page_list entries consistently because the alternate paths involving
> those appear to do the right thing already.
Hi, Robin
Thanks for your suggestion.
We will fix it.
Regards
Wei Hu
> That is, of course, assuming that the buffers involved can be so large
> that it's not practical to just always make a single allocation and
> fragment it into multiple descriptors if the hardware does have some
> maximum length constraint - frankly I'm a little puzzled by the
> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>
> Robin.
>
>> Regards
>> Wei Hu
>>>> } else {
>>>> kfree(d_page);
>>>> d_page = NULL;
>>>> }
>>>> return d_page;
>>>> }
>>>>
>>>> Regards
>>>> Wei Hu
>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>> +++++++++++++++++++++++++++---
>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> index 3e4c525..a69cd4b 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>> goto err_free;
>>>>>>
>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>> + pages[i] =
>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>
>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>> PAGE_KERNEL);
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> gfp_t gfp_mask)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>> struct hns_roce_hem *hem;
>>>>>> struct scatterlist *mem;
>>>>>> int order;
>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>> chunk->npages = 0;
>>>>>> chunk->nsg = 0;
>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>> }
>>>>>>
>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> if (!buf)
>>>>>> goto fail;
>>>>>>
>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>> + } else {
>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>> + }
>>>>>> WARN_ON(mem->offset);
>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>
>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>> hns_roce_hem *hem)
>>>>>> {
>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>> + void *cpu_addr;
>>>>>> int i;
>>>>>>
>>>>>> if (!hem)
>>>>>> return;
>>>>>>
>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>> + else
>>>>>> + cpu_addr =
>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>> +
>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>> chunk->mem[i].length,
>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>> + cpu_addr,
>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>> + }
>>>>>> kfree(chunk);
>>>>>> }
>>>>>>
>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>> *hr_dev,
>>>>>>
>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>> + mutex_unlock(&table->mutex);
>>>>>> + return page ?
>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>> + + offset : NULL;
>>>>>> + }
>>>>>> goto out;
>>>>>> }
>>>>>> offset -= chunk->mem[i].length;
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> index af28bbf..62d712a 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>> };
>>>>>>
>>>>>> +struct hns_roce_vmalloc {
>>>>>> + bool is_vmalloc_addr;
>>>>>> + void *vmalloc_addr;
>>>>>> +};
>>>>>> +
>>>>>> struct hns_roce_hem_chunk {
>>>>>> struct list_head list;
>>>>>> int npages;
>>>>>> int nsg;
>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>> };
>>>>>>
>>>>>> struct hns_roce_hem {
>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> index b99d70a..9e19bf1 100644
>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>> {
>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>> struct scatterlist *sg;
>>>>>> + u64 page_addr = 0;
>>>>>> u64 *pages;
>>>>>> + int i = 0, j = 0;
>>>>>> + int len = 0;
>>>>>> int entry;
>>>>>> - int i;
>>>>>>
>>>>>> mpt_entry = mb_buf;
>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>
>>>>>> i = 0;
>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>> -
>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> - break;
>>>>>> - i++;
>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>> + for (j = 0; j < len; ++j) {
>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>> + (j << mr->umem->page_shift);
>>>>>> + pages[i] = page_addr >> 6;
>>>>>> +
>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>> + goto found;
>>>>>> + i++;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> +found:
>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>> _______________________________________________
>>>> iommu mailing list
>>>> iommu@lists.linux-foundation.org
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>> .
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>
^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2017-11-09 1:36 UTC | newest]
Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-30 9:28 [PATCH for-next 0/4] Add Features & Code improvements for hip08 Wei Hu (Xavier)
2017-09-30 9:28 ` Wei Hu (Xavier)
[not found] ` <1506763741-81429-1-git-send-email-xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-09-30 9:28 ` [PATCH for-next 1/4] RDMA/hns: Support WQE/CQE/PBL page size configurable feature in hip08 Wei Hu (Xavier)
2017-09-30 9:28 ` Wei Hu (Xavier)
2017-09-30 9:29 ` [PATCH for-next 3/4] RDMA/hns: Update the IRRL table chunk size " Wei Hu (Xavier)
2017-09-30 9:29 ` Wei Hu (Xavier)
[not found] ` <1506763741-81429-4-git-send-email-xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-10-01 5:40 ` Leon Romanovsky
2017-10-01 5:40 ` Leon Romanovsky
2017-10-17 11:40 ` Wei Hu (Xavier)
2017-10-17 11:40 ` Wei Hu (Xavier)
2017-09-30 9:28 ` [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support " Wei Hu (Xavier)
2017-09-30 9:28 ` Wei Hu (Xavier)
[not found] ` <1506763741-81429-3-git-send-email-xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-09-30 16:10 ` Leon Romanovsky
2017-09-30 16:10 ` Leon Romanovsky
2017-10-12 12:31 ` Wei Hu (Xavier)
2017-10-12 12:31 ` Wei Hu (Xavier)
[not found] ` <59DF60A3.7080803-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-10-12 12:59 ` Robin Murphy
2017-10-12 12:59 ` Robin Murphy
2017-10-12 12:59 ` Robin Murphy
2017-11-01 7:46 ` Wei Hu (Xavier)
2017-11-01 7:46 ` Wei Hu (Xavier)
[not found] ` <59F97BBE.5070207-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-11-01 12:26 ` Robin Murphy
2017-11-01 12:26 ` Robin Murphy
2017-11-01 12:26 ` Robin Murphy
[not found] ` <fc7433af-4fa7-6b78-6bec-26941a427002-5wv7dgnIgG8@public.gmane.org>
2017-11-07 2:45 ` Wei Hu (Xavier)
2017-11-07 2:45 ` Wei Hu (Xavier)
2017-11-07 2:45 ` Wei Hu (Xavier)
[not found] ` <5A011E49.6060407-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-11-07 6:32 ` Leon Romanovsky
2017-11-07 6:32 ` Leon Romanovsky
[not found] ` <20171107063209.GA18825-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-11-09 1:17 ` Wei Hu (Xavier)
2017-11-09 1:17 ` Wei Hu (Xavier)
2017-11-09 1:17 ` Wei Hu (Xavier)
2017-11-07 15:48 ` Jason Gunthorpe
2017-11-07 15:48 ` Jason Gunthorpe
[not found] ` <20171107154838.GC21466-uk2M96/98Pc@public.gmane.org>
2017-11-07 15:58 ` Christoph Hellwig
2017-11-07 15:58 ` Christoph Hellwig
2017-11-07 15:58 ` Christoph Hellwig
[not found] ` <20171107155805.GA24082-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-11-07 16:03 ` Jason Gunthorpe
2017-11-07 16:03 ` Jason Gunthorpe
2017-11-07 16:03 ` Jason Gunthorpe
2017-11-09 1:26 ` Wei Hu (Xavier)
2017-11-09 1:26 ` Wei Hu (Xavier)
2017-11-09 1:26 ` Wei Hu (Xavier)
2017-11-09 1:30 ` Wei Hu (Xavier)
2017-11-09 1:30 ` Wei Hu (Xavier)
2017-11-09 1:30 ` Wei Hu (Xavier)
2017-11-09 1:36 ` Wei Hu (Xavier)
2017-11-09 1:36 ` Wei Hu (Xavier)
2017-10-12 14:54 ` Leon Romanovsky
2017-10-18 8:42 ` Wei Hu (Xavier)
2017-10-18 8:42 ` Wei Hu (Xavier)
[not found] ` <59E713EE.5040703-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-10-18 9:12 ` Wei Hu (Xavier)
2017-10-18 9:12 ` Wei Hu (Xavier)
[not found] ` <59E71AE2.6080202-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-10-18 14:23 ` Leon Romanovsky
2017-10-18 14:23 ` Leon Romanovsky
2017-09-30 9:29 ` [PATCH for-next 4/4] RDMA/hns: Update the PD&CQE&MTT specification " Wei Hu (Xavier)
2017-09-30 9:29 ` Wei Hu (Xavier)
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.