[RFC PATCH 0/2] IB/uverbs: Add support for registering mmapped memory

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] IB/uverbs: Add support for registering mmapped memory
@ 2010-12-02 19:02 Tom Tucker
       [not found] ` <20101202190157.13657.58176.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Tom Tucker @ 2010-12-02 19:02 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: tom-/Yg/VP3ZvrM, garlick-i2BcT+NCU+M,
	acook-osxm6dDZNBBZx8iatJs59jGjJy/sRE9J

This patch series adds the ability for a user-mode program to register
mmapped memory.  The capability was developed to support the sharing of
device memory, for example PCI-E static/flash ram devices, on the network
with RDMA.  It is also useful for sharing kernel resident data with distributed
system monitoring applications (e.g. vmstats) at zero overhead to the
monitored host.

---

Tom Tucker (2):
      IB/uverbs: Add support for user registration of mmap memory
      IB/uverbs: Add memory type to ib_umem structure


 drivers/infiniband/core/umem.c |  272 +++++++++++++++++++++++++++++++++++++---
 include/rdma/ib_umem.h         |    6 +
 2 files changed, 259 insertions(+), 19 deletions(-)

-- 
Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/2] IB/uverbs: Add memory type to ib_umem structure
       [not found] ` <20101202190157.13657.58176.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
@ 2010-12-02 19:02   ` Tom Tucker
  2010-12-02 19:02   ` [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory Tom Tucker
  1 sibling, 0 replies; 6+ messages in thread
From: Tom Tucker @ 2010-12-02 19:02 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: tom-/Yg/VP3ZvrM, garlick-i2BcT+NCU+M,
	acook-osxm6dDZNBBZx8iatJs59jGjJy/sRE9J

Add a memory type field to the ib_umem structure to keep track
of the type of memory registered by the user.

Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
---

 include/rdma/ib_umem.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 9ee0d2e..566b07f 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -39,8 +39,14 @@
 
 struct ib_ucontext;
 
+enum ib_umem_type {
+	IB_UMEM_MEM_MAP = 0,
+	IB_UMEM_IO_MAP
+};
+
 struct ib_umem {
 	struct ib_ucontext     *context;
+	enum ib_umem_type	type;
 	size_t			length;
 	int			offset;
 	int			page_size;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory
       [not found] ` <20101202190157.13657.58176.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
  2010-12-02 19:02   ` [RFC PATCH 1/2] IB/uverbs: Add memory type to ib_umem structure Tom Tucker
@ 2010-12-02 19:02   ` Tom Tucker
       [not found]     ` <20101202190229.13657.85728.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Tom Tucker @ 2010-12-02 19:02 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: tom-/Yg/VP3ZvrM, garlick-i2BcT+NCU+M,
	acook-osxm6dDZNBBZx8iatJs59jGjJy/sRE9J

Added support to the ib_umem_get helper function for handling
mmaped memory.

Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
---

 drivers/infiniband/core/umem.c |  272 +++++++++++++++++++++++++++++++++++++---
 1 files changed, 253 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 415e186..357ca5e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -52,30 +52,24 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	int i;
 
 	list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) {
-		ib_dma_unmap_sg(dev, chunk->page_list,
-				chunk->nents, DMA_BIDIRECTIONAL);
-		for (i = 0; i < chunk->nents; ++i) {
-			struct page *page = sg_page(&chunk->page_list[i]);
-
-			if (umem->writable && dirty)
-				set_page_dirty_lock(page);
-			put_page(page);
-		}
+		if (umem->type == IB_UMEM_MEM_MAP) {
+			ib_dma_unmap_sg(dev, chunk->page_list,
+					chunk->nents, DMA_BIDIRECTIONAL);
+			for (i = 0; i < chunk->nents; ++i) {
+				struct page *page = sg_page(&chunk->page_list[i]);
 
+				if (umem->writable && dirty)
+					set_page_dirty_lock(page);
+				put_page(page);
+			}
+		}
 		kfree(chunk);
 	}
 }
 
-/**
- * ib_umem_get - Pin and DMA map userspace memory.
- * @context: userspace context to pin memory for
- * @addr: userspace virtual address to start at
- * @size: length of region to pin
- * @access: IB_ACCESS_xxx flags for memory being pinned
- * @dmasync: flush in-flight DMA when the memory region is written
- */
-struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-			    size_t size, int access, int dmasync)
+static struct ib_umem *__umem_get(struct ib_ucontext *context,
+				  unsigned long addr, size_t size,
+				  int access, int dmasync)
 {
 	struct ib_umem *umem;
 	struct page **page_list;
@@ -100,6 +94,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	if (!umem)
 		return ERR_PTR(-ENOMEM);
 
+	umem->type	= IB_UMEM_MEM_MAP;
 	umem->context   = context;
 	umem->length    = size;
 	umem->offset    = addr & ~PAGE_MASK;
@@ -215,6 +210,245 @@ out:
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
 }
+
+/*
+ * Return the PFN for the specified address in the vma. This only
+ * works for a vma that is VM_PFNMAP.
+ */
+static unsigned long __follow_io_pfn(struct vm_area_struct *vma,
+				     unsigned long address, int write)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep, pte;
+	spinlock_t *ptl;
+	unsigned long pfn;
+	struct mm_struct *mm = vma->vm_mm;
+
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		return 0;
+
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud))
+		return 0;
+	if (unlikely(pud_bad(*pud)))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		return 0;
+	if (unlikely(pmd_bad(*pmd)))
+		return 0;
+
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	pte = *ptep;
+	if (!pte_present(pte))
+		goto bad;
+	if (write && !pte_write(pte))
+		goto bad;
+
+	pfn = pte_pfn(pte);
+	pte_unmap_unlock(ptep, ptl);
+	return pfn;
+ bad:
+	pte_unmap_unlock(ptep, ptl);
+	return 0;
+}
+
+static int __get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
+			unsigned long start, int len, int write, int force,
+			unsigned long *pfn_list, struct vm_area_struct **vmas)
+{
+	unsigned long pfn;
+	int i;
+	if (len <= 0)
+		return 0;
+
+	i = 0;
+	do {
+		struct vm_area_struct *vma;
+
+		vma = find_vma(mm, start);
+		if (!(vma->vm_flags & VM_PFNMAP))
+			return -EINVAL;
+
+		if (!(vma->vm_flags & VM_IO))
+			return -EFAULT;
+
+		if (is_vm_hugetlb_page(vma))
+			return -EFAULT;
+
+		do {
+			cond_resched();
+			pfn = __follow_io_pfn(vma, start, write);
+			if (!pfn)
+				return -EFAULT;
+			if (pfn_list)
+				pfn_list[i] = pfn;
+			if (vmas)
+				vmas[i] = vma;
+			i++;
+			start += PAGE_SIZE;
+			len--;
+		} while (len && start < vma->vm_end);
+	} while (len);
+	return i;
+}
+
+static struct ib_umem *__iomem_get(struct ib_ucontext *context,
+				   unsigned long addr, size_t size,
+				   int access, int dmasync)
+{
+	struct ib_umem *umem;
+	unsigned long *pfn_list;
+	struct ib_umem_chunk *chunk;
+	unsigned long locked;
+	unsigned long lock_limit;
+	unsigned long cur_base;
+	unsigned long npages;
+	int ret;
+	int off;
+	int i;
+	DEFINE_DMA_ATTRS(attrs);
+
+	if (dmasync)
+		dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
+
+	if (!can_do_mlock())
+		return ERR_PTR(-EPERM);
+
+	umem = kmalloc(sizeof *umem, GFP_KERNEL);
+	if (!umem)
+		return ERR_PTR(-ENOMEM);
+
+	umem->type	= IB_UMEM_IO_MAP;
+	umem->context   = context;
+	umem->length    = size;
+	umem->offset    = addr & ~PAGE_MASK;
+	umem->page_size = PAGE_SIZE;
+
+	/*
+	 * We ask for writable memory if any access flags other than
+	 * "remote read" are set.  "Local write" and "remote write"
+	 * obviously require write access.  "Remote atomic" can do
+	 * things like fetch and add, which will modify memory, and
+	 * "MW bind" can change permissions by binding a window.
+	 */
+	umem->writable  = !!(access & ~IB_ACCESS_REMOTE_READ);
+
+	/* IO memory is not hugetlb memory */
+	umem->hugetlb   = 0;
+
+	INIT_LIST_HEAD(&umem->chunk_list);
+
+	pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
+	if (!pfn_list) {
+		kfree(umem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT;
+
+	down_write(&current->mm->mmap_sem);
+
+	locked     = npages + current->mm->locked_vm;
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
+
+	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	cur_base = addr & PAGE_MASK;
+
+	ret = 0;
+	while (npages) {
+		ret = __get_io_pfn(current, current->mm, cur_base,
+				   min_t(unsigned long, npages,
+					 PAGE_SIZE / sizeof(unsigned long *)),
+				   umem->writable,
+				   !umem->writable, pfn_list, NULL);
+
+		if (ret < 0)
+			goto out;
+
+		cur_base += ret * PAGE_SIZE;
+		npages   -= ret;
+
+		off = 0;
+
+		while (ret) {
+			chunk = kmalloc(sizeof *chunk +
+					sizeof(struct scatterlist) *
+					min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
+					GFP_KERNEL);
+			if (!chunk) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
+			sg_init_table(chunk->page_list, chunk->nents);
+			/* The pfn_list we built is a set of Page
+			 * Frame Numbers (PFN) whose physical address
+			 * is PFN << PAGE_SHIFT. The SG DMA mapping
+			 * services expect page addresses, not PFN,
+			 * therefore, we have to do the dma mapping
+			 * ourselves here. */
+			for (i = 0; i < chunk->nents; ++i) {
+				sg_set_page(&chunk->page_list[i], 0,
+					    PAGE_SIZE, 0);
+				chunk->page_list[i].dma_address =
+					(pfn_list[i + off] << PAGE_SHIFT);
+				chunk->page_list[i].dma_length = PAGE_SIZE;
+			}
+			chunk->nmap = chunk->nents;
+			ret -= chunk->nents;
+			off += chunk->nents;
+			list_add_tail(&chunk->list, &umem->chunk_list);
+		}
+	}
+
+out:
+	if (ret < 0) {
+		__ib_umem_release(context->device, umem, 0);
+		kfree(umem);
+	} else
+		current->mm->locked_vm = locked;
+	up_write(&current->mm->mmap_sem);
+	free_page((unsigned long) pfn_list);
+	return ret < 0 ? ERR_PTR(ret) : umem;
+}
+
+/**
+ * ib_umem_get - Pin and DMA map userspace memory.
+ * @context: userspace context to pin memory for
+ * @addr: userspace virtual address to start at
+ * @size: length of region to pin
+ * @access: IB_ACCESS_xxx flags for memory being pinned
+ * @dmasync: flush in-flight DMA when the memory region is written
+ */
+struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
+			    size_t size, int access, int dmasync)
+{
+	/* Do a quick lookup of the containing VMA to determine it's type. */
+	struct vm_area_struct *vma;
+	int do_pfn_map = 0;
+
+	down_write(&current->mm->mmap_sem);
+	vma = find_vma(current->mm, addr & PAGE_MASK);
+	if (vma)
+		do_pfn_map = vma->vm_flags & VM_PFNMAP;
+	up_write(&current->mm->mmap_sem);
+	if (!vma)
+		return ERR_PTR(-EINVAL);
+
+	if (do_pfn_map)
+		return __iomem_get(context, addr, size, access, dmasync);
+
+	return __umem_get(context, addr, size, access, dmasync);
+}
 EXPORT_SYMBOL(ib_umem_get);
 
 static void ib_umem_account(struct work_struct *work)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory
       [not found]     ` <20101202190229.13657.85728.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
@ 2010-12-02 19:35       ` Ralph Campbell
       [not found]         ` <1291318531.24256.272.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Ralph Campbell @ 2010-12-02 19:35 UTC (permalink / raw)
  To: Tom Tucker
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, garlick-i2BcT+NCU+M,
	acook-osxm6dDZNBBZx8iatJs59jGjJy/sRE9J

I understand the need for something like this patch since
the GPU folks would also like to mmap memory (although the
memory is marked as vma->vm_flags & VM_IO).

It seems to me that duplicating the page walking code is
the wrong approach and exporting a new interface from
mm/memory.c is more appropriate.

Also, the quick check to find_vma() is essentially duplicated
if get_user_pages() is called and it doesn't handle the case
when the region spans multiple vma regions with different flags.
Maybe we can modify get_user_pages to have a new flag which
allows VM_PFNMAP segments to be accessed as IB memory regions.
The problem is that VM_PFNMAP means there is no corresponding
struct page to handle reference counting. What happens if the
device that exports the VM_PFNMAP memory is hot removed?
Can the device reference count be incremented to prevent that?


On Thu, 2010-12-02 at 11:02 -0800, Tom Tucker wrote:
> Added support to the ib_umem_get helper function for handling
> mmaped memory.
> 
> Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
> ---
> 
>  drivers/infiniband/core/umem.c |  272 +++++++++++++++++++++++++++++++++++++---
>  1 files changed, 253 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 415e186..357ca5e 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -52,30 +52,24 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>  	int i;
>  
>  	list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) {
> -		ib_dma_unmap_sg(dev, chunk->page_list,
> -				chunk->nents, DMA_BIDIRECTIONAL);
> -		for (i = 0; i < chunk->nents; ++i) {
> -			struct page *page = sg_page(&chunk->page_list[i]);
> -
> -			if (umem->writable && dirty)
> -				set_page_dirty_lock(page);
> -			put_page(page);
> -		}
> +		if (umem->type == IB_UMEM_MEM_MAP) {
> +			ib_dma_unmap_sg(dev, chunk->page_list,
> +					chunk->nents, DMA_BIDIRECTIONAL);
> +			for (i = 0; i < chunk->nents; ++i) {
> +				struct page *page = sg_page(&chunk->page_list[i]);
>  
> +				if (umem->writable && dirty)
> +					set_page_dirty_lock(page);
> +				put_page(page);
> +			}
> +		}
>  		kfree(chunk);
>  	}
>  }
>  
> -/**
> - * ib_umem_get - Pin and DMA map userspace memory.
> - * @context: userspace context to pin memory for
> - * @addr: userspace virtual address to start at
> - * @size: length of region to pin
> - * @access: IB_ACCESS_xxx flags for memory being pinned
> - * @dmasync: flush in-flight DMA when the memory region is written
> - */
> -struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> -			    size_t size, int access, int dmasync)
> +static struct ib_umem *__umem_get(struct ib_ucontext *context,
> +				  unsigned long addr, size_t size,
> +				  int access, int dmasync)
>  {
>  	struct ib_umem *umem;
>  	struct page **page_list;
> @@ -100,6 +94,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  	if (!umem)
>  		return ERR_PTR(-ENOMEM);
>  
> +	umem->type	= IB_UMEM_MEM_MAP;
>  	umem->context   = context;
>  	umem->length    = size;
>  	umem->offset    = addr & ~PAGE_MASK;
> @@ -215,6 +210,245 @@ out:
>  
>  	return ret < 0 ? ERR_PTR(ret) : umem;
>  }
> +
> +/*
> + * Return the PFN for the specified address in the vma. This only
> + * works for a vma that is VM_PFNMAP.
> + */
> +static unsigned long __follow_io_pfn(struct vm_area_struct *vma,
> +				     unsigned long address, int write)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *ptep, pte;
> +	spinlock_t *ptl;
> +	unsigned long pfn;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		return 0;
> +
> +	pud = pud_offset(pgd, address);
> +	if (pud_none(*pud))
> +		return 0;
> +	if (unlikely(pud_bad(*pud)))
> +		return 0;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		return 0;
> +	if (unlikely(pmd_bad(*pmd)))
> +		return 0;
> +
> +	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	pte = *ptep;
> +	if (!pte_present(pte))
> +		goto bad;
> +	if (write && !pte_write(pte))
> +		goto bad;
> +
> +	pfn = pte_pfn(pte);
> +	pte_unmap_unlock(ptep, ptl);
> +	return pfn;
> + bad:
> +	pte_unmap_unlock(ptep, ptl);
> +	return 0;
> +}
> +
> +static int __get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
> +			unsigned long start, int len, int write, int force,
> +			unsigned long *pfn_list, struct vm_area_struct **vmas)
> +{
> +	unsigned long pfn;
> +	int i;
> +	if (len <= 0)
> +		return 0;
> +
> +	i = 0;
> +	do {
> +		struct vm_area_struct *vma;
> +
> +		vma = find_vma(mm, start);
> +		if (!(vma->vm_flags & VM_PFNMAP))
> +			return -EINVAL;
> +
> +		if (!(vma->vm_flags & VM_IO))
> +			return -EFAULT;
> +
> +		if (is_vm_hugetlb_page(vma))
> +			return -EFAULT;
> +
> +		do {
> +			cond_resched();
> +			pfn = __follow_io_pfn(vma, start, write);
> +			if (!pfn)
> +				return -EFAULT;
> +			if (pfn_list)
> +				pfn_list[i] = pfn;
> +			if (vmas)
> +				vmas[i] = vma;
> +			i++;
> +			start += PAGE_SIZE;
> +			len--;
> +		} while (len && start < vma->vm_end);
> +	} while (len);
> +	return i;
> +}
> +
> +static struct ib_umem *__iomem_get(struct ib_ucontext *context,
> +				   unsigned long addr, size_t size,
> +				   int access, int dmasync)
> +{
> +	struct ib_umem *umem;
> +	unsigned long *pfn_list;
> +	struct ib_umem_chunk *chunk;
> +	unsigned long locked;
> +	unsigned long lock_limit;
> +	unsigned long cur_base;
> +	unsigned long npages;
> +	int ret;
> +	int off;
> +	int i;
> +	DEFINE_DMA_ATTRS(attrs);
> +
> +	if (dmasync)
> +		dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
> +
> +	if (!can_do_mlock())
> +		return ERR_PTR(-EPERM);
> +
> +	umem = kmalloc(sizeof *umem, GFP_KERNEL);
> +	if (!umem)
> +		return ERR_PTR(-ENOMEM);
> +
> +	umem->type	= IB_UMEM_IO_MAP;
> +	umem->context   = context;
> +	umem->length    = size;
> +	umem->offset    = addr & ~PAGE_MASK;
> +	umem->page_size = PAGE_SIZE;
> +
> +	/*
> +	 * We ask for writable memory if any access flags other than
> +	 * "remote read" are set.  "Local write" and "remote write"
> +	 * obviously require write access.  "Remote atomic" can do
> +	 * things like fetch and add, which will modify memory, and
> +	 * "MW bind" can change permissions by binding a window.
> +	 */
> +	umem->writable  = !!(access & ~IB_ACCESS_REMOTE_READ);
> +
> +	/* IO memory is not hugetlb memory */
> +	umem->hugetlb   = 0;
> +
> +	INIT_LIST_HEAD(&umem->chunk_list);
> +
> +	pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
> +	if (!pfn_list) {
> +		kfree(umem);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT;
> +
> +	down_write(&current->mm->mmap_sem);
> +
> +	locked     = npages + current->mm->locked_vm;
> +	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
> +
> +	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	cur_base = addr & PAGE_MASK;
> +
> +	ret = 0;
> +	while (npages) {
> +		ret = __get_io_pfn(current, current->mm, cur_base,
> +				   min_t(unsigned long, npages,
> +					 PAGE_SIZE / sizeof(unsigned long *)),
> +				   umem->writable,
> +				   !umem->writable, pfn_list, NULL);
> +
> +		if (ret < 0)
> +			goto out;
> +
> +		cur_base += ret * PAGE_SIZE;
> +		npages   -= ret;
> +
> +		off = 0;
> +
> +		while (ret) {
> +			chunk = kmalloc(sizeof *chunk +
> +					sizeof(struct scatterlist) *
> +					min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
> +					GFP_KERNEL);
> +			if (!chunk) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
> +			sg_init_table(chunk->page_list, chunk->nents);
> +			/* The pfn_list we built is a set of Page
> +			 * Frame Numbers (PFN) whose physical address
> +			 * is PFN << PAGE_SHIFT. The SG DMA mapping
> +			 * services expect page addresses, not PFN,
> +			 * therefore, we have to do the dma mapping
> +			 * ourselves here. */
> +			for (i = 0; i < chunk->nents; ++i) {
> +				sg_set_page(&chunk->page_list[i], 0,
> +					    PAGE_SIZE, 0);
> +				chunk->page_list[i].dma_address =
> +					(pfn_list[i + off] << PAGE_SHIFT);
> +				chunk->page_list[i].dma_length = PAGE_SIZE;
> +			}
> +			chunk->nmap = chunk->nents;
> +			ret -= chunk->nents;
> +			off += chunk->nents;
> +			list_add_tail(&chunk->list, &umem->chunk_list);
> +		}
> +	}
> +
> +out:
> +	if (ret < 0) {
> +		__ib_umem_release(context->device, umem, 0);
> +		kfree(umem);
> +	} else
> +		current->mm->locked_vm = locked;
> +	up_write(&current->mm->mmap_sem);
> +	free_page((unsigned long) pfn_list);
> +	return ret < 0 ? ERR_PTR(ret) : umem;
> +}
> +
> +/**
> + * ib_umem_get - Pin and DMA map userspace memory.
> + * @context: userspace context to pin memory for
> + * @addr: userspace virtual address to start at
> + * @size: length of region to pin
> + * @access: IB_ACCESS_xxx flags for memory being pinned
> + * @dmasync: flush in-flight DMA when the memory region is written
> + */
> +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> +			    size_t size, int access, int dmasync)
> +{
> +	/* Do a quick lookup of the containing VMA to determine it's type. */
> +	struct vm_area_struct *vma;
> +	int do_pfn_map = 0;
> +
> +	down_write(&current->mm->mmap_sem);
> +	vma = find_vma(current->mm, addr & PAGE_MASK);
> +	if (vma)
> +		do_pfn_map = vma->vm_flags & VM_PFNMAP;
> +	up_write(&current->mm->mmap_sem);
> +	if (!vma)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (do_pfn_map)
> +		return __iomem_get(context, addr, size, access, dmasync);
> +
> +	return __umem_get(context, addr, size, access, dmasync);
> +}
>  EXPORT_SYMBOL(ib_umem_get);
>  
>  static void ib_umem_account(struct work_struct *work)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory
       [not found]         ` <1291318531.24256.272.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-12-02 20:24           ` Tom Tucker
       [not found]             ` <4CF80083.2040907-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Tom Tucker @ 2010-12-02 20:24 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	garlick-i2BcT+NCU+M, acook-osxm6dDZNBBZx8iatJs59jGjJy/sRE9J

Personally I think the biggest issue is that I don't think the pfn to dma 
address mapping logic is portable.

On 12/2/10 1:35 PM, Ralph Campbell wrote:
> I understand the need for something like this patch since
> the GPU folks would also like to mmap memory (although the
> memory is marked as vma->vm_flags&  VM_IO).
>
> It seems to me that duplicating the page walking code is
> the wrong approach and exporting a new interface from
> mm/memory.c is more appropriate.
>
Perhaps, but that's kernel proper (not a module) and has it's own issues. 
For example, it represents an exported kernel interface and therefore a 
kernel compatability commitment going forward. I suggest that a new kernel 
interface is a separate effort that this code code utilize going forward.

> Also, the quick check to find_vma() is essentially duplicated
> if get_user_pages() is called

You need to know the type before you know how to handle it. Unless you 
want to tear up get_user_pages, i think this non-performance path double 
lookup is a non issue.

> and it doesn't handle the case
> when the region spans multiple vma regions with different flags.

Actually, it specifically does not allow that and I'm not sure that is 
something you would want to support.

> Maybe we can modify get_user_pages to have a new flag which
> allows VM_PFNMAP segments to be accessed as IB memory regions.
> The problem is that VM_PFNMAP means there is no corresponding
> struct page to handle reference counting. What happens if the
> device that exports the VM_PFNMAP memory is hot removed?

Bus Address Error.

> Can the device reference count be incremented to prevent that?
>

I don't think that would go in this code, it would go in the driver that 
gave the user the address in the first place.

> On Thu, 2010-12-02 at 11:02 -0800, Tom Tucker wrote:
>> Added support to the ib_umem_get helper function for handling
>> mmaped memory.
>>
>> Signed-off-by: Tom Tucker<tom-/Yg/VP3ZvrM@public.gmane.org>
>> ---
>>
>>   drivers/infiniband/core/umem.c |  272 +++++++++++++++++++++++++++++++++++++---
>>   1 files changed, 253 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index 415e186..357ca5e 100644
>> --- a/drivers/infiniband/core/umem.c
>> +++ b/drivers/infiniband/core/umem.c
>> @@ -52,30 +52,24 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>>   	int i;
>>
>>   	list_for_each_entry_safe(chunk, tmp,&umem->chunk_list, list) {
>> -		ib_dma_unmap_sg(dev, chunk->page_list,
>> -				chunk->nents, DMA_BIDIRECTIONAL);
>> -		for (i = 0; i<  chunk->nents; ++i) {
>> -			struct page *page = sg_page(&chunk->page_list[i]);
>> -
>> -			if (umem->writable&&  dirty)
>> -				set_page_dirty_lock(page);
>> -			put_page(page);
>> -		}
>> +		if (umem->type == IB_UMEM_MEM_MAP) {
>> +			ib_dma_unmap_sg(dev, chunk->page_list,
>> +					chunk->nents, DMA_BIDIRECTIONAL);
>> +			for (i = 0; i<  chunk->nents; ++i) {
>> +				struct page *page = sg_page(&chunk->page_list[i]);
>>
>> +				if (umem->writable&&  dirty)
>> +					set_page_dirty_lock(page);
>> +				put_page(page);
>> +			}
>> +		}
>>   		kfree(chunk);
>>   	}
>>   }
>>
>> -/**
>> - * ib_umem_get - Pin and DMA map userspace memory.
>> - * @context: userspace context to pin memory for
>> - * @addr: userspace virtual address to start at
>> - * @size: length of region to pin
>> - * @access: IB_ACCESS_xxx flags for memory being pinned
>> - * @dmasync: flush in-flight DMA when the memory region is written
>> - */
>> -struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>> -			    size_t size, int access, int dmasync)
>> +static struct ib_umem *__umem_get(struct ib_ucontext *context,
>> +				  unsigned long addr, size_t size,
>> +				  int access, int dmasync)
>>   {
>>   	struct ib_umem *umem;
>>   	struct page **page_list;
>> @@ -100,6 +94,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>>   	if (!umem)
>>   		return ERR_PTR(-ENOMEM);
>>
>> +	umem->type	= IB_UMEM_MEM_MAP;
>>   	umem->context   = context;
>>   	umem->length    = size;
>>   	umem->offset    = addr&  ~PAGE_MASK;
>> @@ -215,6 +210,245 @@ out:
>>
>>   	return ret<  0 ? ERR_PTR(ret) : umem;
>>   }
>> +
>> +/*
>> + * Return the PFN for the specified address in the vma. This only
>> + * works for a vma that is VM_PFNMAP.
>> + */
>> +static unsigned long __follow_io_pfn(struct vm_area_struct *vma,
>> +				     unsigned long address, int write)
>> +{
>> +	pgd_t *pgd;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +	pte_t *ptep, pte;
>> +	spinlock_t *ptl;
>> +	unsigned long pfn;
>> +	struct mm_struct *mm = vma->vm_mm;
>> +
>> +	pgd = pgd_offset(mm, address);
>> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> +		return 0;
>> +
>> +	pud = pud_offset(pgd, address);
>> +	if (pud_none(*pud))
>> +		return 0;
>> +	if (unlikely(pud_bad(*pud)))
>> +		return 0;
>> +
>> +	pmd = pmd_offset(pud, address);
>> +	if (pmd_none(*pmd))
>> +		return 0;
>> +	if (unlikely(pmd_bad(*pmd)))
>> +		return 0;
>> +
>> +	ptep = pte_offset_map_lock(mm, pmd, address,&ptl);
>> +	pte = *ptep;
>> +	if (!pte_present(pte))
>> +		goto bad;
>> +	if (write&&  !pte_write(pte))
>> +		goto bad;
>> +
>> +	pfn = pte_pfn(pte);
>> +	pte_unmap_unlock(ptep, ptl);
>> +	return pfn;
>> + bad:
>> +	pte_unmap_unlock(ptep, ptl);
>> +	return 0;
>> +}
>> +
>> +static int __get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
>> +			unsigned long start, int len, int write, int force,
>> +			unsigned long *pfn_list, struct vm_area_struct **vmas)
>> +{
>> +	unsigned long pfn;
>> +	int i;
>> +	if (len<= 0)
>> +		return 0;
>> +
>> +	i = 0;
>> +	do {
>> +		struct vm_area_struct *vma;
>> +
>> +		vma = find_vma(mm, start);
>> +		if (!(vma->vm_flags&  VM_PFNMAP))
>> +			return -EINVAL;
>> +
>> +		if (!(vma->vm_flags&  VM_IO))
>> +			return -EFAULT;
>> +
>> +		if (is_vm_hugetlb_page(vma))
>> +			return -EFAULT;
>> +
>> +		do {
>> +			cond_resched();
>> +			pfn = __follow_io_pfn(vma, start, write);
>> +			if (!pfn)
>> +				return -EFAULT;
>> +			if (pfn_list)
>> +				pfn_list[i] = pfn;
>> +			if (vmas)
>> +				vmas[i] = vma;
>> +			i++;
>> +			start += PAGE_SIZE;
>> +			len--;
>> +		} while (len&&  start<  vma->vm_end);
>> +	} while (len);
>> +	return i;
>> +}
>> +
>> +static struct ib_umem *__iomem_get(struct ib_ucontext *context,
>> +				   unsigned long addr, size_t size,
>> +				   int access, int dmasync)
>> +{
>> +	struct ib_umem *umem;
>> +	unsigned long *pfn_list;
>> +	struct ib_umem_chunk *chunk;
>> +	unsigned long locked;
>> +	unsigned long lock_limit;
>> +	unsigned long cur_base;
>> +	unsigned long npages;
>> +	int ret;
>> +	int off;
>> +	int i;
>> +	DEFINE_DMA_ATTRS(attrs);
>> +
>> +	if (dmasync)
>> +		dma_set_attr(DMA_ATTR_WRITE_BARRIER,&attrs);
>> +
>> +	if (!can_do_mlock())
>> +		return ERR_PTR(-EPERM);
>> +
>> +	umem = kmalloc(sizeof *umem, GFP_KERNEL);
>> +	if (!umem)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	umem->type	= IB_UMEM_IO_MAP;
>> +	umem->context   = context;
>> +	umem->length    = size;
>> +	umem->offset    = addr&  ~PAGE_MASK;
>> +	umem->page_size = PAGE_SIZE;
>> +
>> +	/*
>> +	 * We ask for writable memory if any access flags other than
>> +	 * "remote read" are set.  "Local write" and "remote write"
>> +	 * obviously require write access.  "Remote atomic" can do
>> +	 * things like fetch and add, which will modify memory, and
>> +	 * "MW bind" can change permissions by binding a window.
>> +	 */
>> +	umem->writable  = !!(access&  ~IB_ACCESS_REMOTE_READ);
>> +
>> +	/* IO memory is not hugetlb memory */
>> +	umem->hugetlb   = 0;
>> +
>> +	INIT_LIST_HEAD(&umem->chunk_list);
>> +
>> +	pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
>> +	if (!pfn_list) {
>> +		kfree(umem);
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>> +
>> +	npages = PAGE_ALIGN(size + umem->offset)>>  PAGE_SHIFT;
>> +
>> +	down_write(&current->mm->mmap_sem);
>> +
>> +	locked     = npages + current->mm->locked_vm;
>> +	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur>>  PAGE_SHIFT;
>> +
>> +	if ((locked>  lock_limit)&&  !capable(CAP_IPC_LOCK)) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	cur_base = addr&  PAGE_MASK;
>> +
>> +	ret = 0;
>> +	while (npages) {
>> +		ret = __get_io_pfn(current, current->mm, cur_base,
>> +				   min_t(unsigned long, npages,
>> +					 PAGE_SIZE / sizeof(unsigned long *)),
>> +				   umem->writable,
>> +				   !umem->writable, pfn_list, NULL);
>> +
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		cur_base += ret * PAGE_SIZE;
>> +		npages   -= ret;
>> +
>> +		off = 0;
>> +
>> +		while (ret) {
>> +			chunk = kmalloc(sizeof *chunk +
>> +					sizeof(struct scatterlist) *
>> +					min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
>> +					GFP_KERNEL);
>> +			if (!chunk) {
>> +				ret = -ENOMEM;
>> +				goto out;
>> +			}
>> +			chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
>> +			sg_init_table(chunk->page_list, chunk->nents);
>> +			/* The pfn_list we built is a set of Page
>> +			 * Frame Numbers (PFN) whose physical address
>> +			 * is PFN<<  PAGE_SHIFT. The SG DMA mapping
>> +			 * services expect page addresses, not PFN,
>> +			 * therefore, we have to do the dma mapping
>> +			 * ourselves here. */
>> +			for (i = 0; i<  chunk->nents; ++i) {
>> +				sg_set_page(&chunk->page_list[i], 0,
>> +					    PAGE_SIZE, 0);
>> +				chunk->page_list[i].dma_address =
>> +					(pfn_list[i + off]<<  PAGE_SHIFT);
>> +				chunk->page_list[i].dma_length = PAGE_SIZE;
>> +			}
>> +			chunk->nmap = chunk->nents;
>> +			ret -= chunk->nents;
>> +			off += chunk->nents;
>> +			list_add_tail(&chunk->list,&umem->chunk_list);
>> +		}
>> +	}
>> +
>> +out:
>> +	if (ret<  0) {
>> +		__ib_umem_release(context->device, umem, 0);
>> +		kfree(umem);
>> +	} else
>> +		current->mm->locked_vm = locked;
>> +	up_write(&current->mm->mmap_sem);
>> +	free_page((unsigned long) pfn_list);
>> +	return ret<  0 ? ERR_PTR(ret) : umem;
>> +}
>> +
>> +/**
>> + * ib_umem_get - Pin and DMA map userspace memory.
>> + * @context: userspace context to pin memory for
>> + * @addr: userspace virtual address to start at
>> + * @size: length of region to pin
>> + * @access: IB_ACCESS_xxx flags for memory being pinned
>> + * @dmasync: flush in-flight DMA when the memory region is written
>> + */
>> +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>> +			    size_t size, int access, int dmasync)
>> +{
>> +	/* Do a quick lookup of the containing VMA to determine it's type. */
>> +	struct vm_area_struct *vma;
>> +	int do_pfn_map = 0;
>> +
>> +	down_write(&current->mm->mmap_sem);
>> +	vma = find_vma(current->mm, addr&  PAGE_MASK);
>> +	if (vma)
>> +		do_pfn_map = vma->vm_flags&  VM_PFNMAP;
>> +	up_write(&current->mm->mmap_sem);
>> +	if (!vma)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	if (do_pfn_map)
>> +		return __iomem_get(context, addr, size, access, dmasync);
>> +
>> +	return __umem_get(context, addr, size, access, dmasync);
>> +}
>>   EXPORT_SYMBOL(ib_umem_get);
>>
>>   static void ib_umem_account(struct work_struct *work)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory
       [not found]             ` <4CF80083.2040907-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-12-02 21:34               ` Ralph Campbell
  0 siblings, 0 replies; 6+ messages in thread
From: Ralph Campbell @ 2010-12-02 21:34 UTC (permalink / raw)
  To: Tom Tucker
  Cc: Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	garlick-i2BcT+NCU+M, acook-osxm6dDZNBBZx8iatJs59jGjJy/sRE9J

On Thu, 2010-12-02 at 12:24 -0800, Tom Tucker wrote:
> Personally I think the biggest issue is that I don't think the pfn to dma 
> address mapping logic is portable.

Perhaps. That is why the core Linux VM folks should be
involved.

> On 12/2/10 1:35 PM, Ralph Campbell wrote:
> > I understand the need for something like this patch since
> > the GPU folks would also like to mmap memory (although the
> > memory is marked as vma->vm_flags&  VM_IO).
> >
> > It seems to me that duplicating the page walking code is
> > the wrong approach and exporting a new interface from
> > mm/memory.c is more appropriate.
> >
> Perhaps, but that's kernel proper (not a module) and has it's own issues. 
> For example, it represents an exported kernel interface and therefore a 
> kernel compatability commitment going forward. I suggest that a new kernel 
> interface is a separate effort that this code code utilize going forward.

I agree. That was what I was implying.

> > Also, the quick check to find_vma() is essentially duplicated
> > if get_user_pages() is called
> 
> You need to know the type before you know how to handle it. Unless you 
> want to tear up get_user_pages, i think this non-performance path double 
> lookup is a non issue.

You only need the type if the translation is handled as the
patch proposes. get_user_pages() could handle getting
the physical addresses differently as it checks each vma region.

> > and it doesn't handle the case
> > when the region spans multiple vma regions with different flags.
> 
> Actually, it specifically does not allow that and I'm not sure that is 
> something you would want to support.

True. It doesn't support it because without reference counting,
there would need to be a callback mechanism to let the caller know
if/when the mapping changes or is invalidated.

> > Maybe we can modify get_user_pages to have a new flag which
> > allows VM_PFNMAP segments to be accessed as IB memory regions.
> > The problem is that VM_PFNMAP means there is no corresponding
> > struct page to handle reference counting. What happens if the
> > device that exports the VM_PFNMAP memory is hot removed?
> 
> Bus Address Error.

> > Can the device reference count be incremented to prevent that?
> >
> 
> I don't think that would go in this code, it would go in the driver that 
> gave the user the address in the first place.

Perhaps. Once the IB core has the physical address, the user may
unmmap the range which would notify the exporting driver via unmmap but
IB wouldn't know that had happened without some sort of callback
notification. The IB hardware could happily go on DMA'ing to that
physical address.

> > On Thu, 2010-12-02 at 11:02 -0800, Tom Tucker wrote:
> >> Added support to the ib_umem_get helper function for handling
> >> mmaped memory.
> >>
> >> Signed-off-by: Tom Tucker<tom-/Yg/VP3ZvrM@public.gmane.org>
> >> ---
> >>
> >>   drivers/infiniband/core/umem.c |  272 +++++++++++++++++++++++++++++++++++++---
> >>   1 files changed, 253 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> >> index 415e186..357ca5e 100644
> >> --- a/drivers/infiniband/core/umem.c
> >> +++ b/drivers/infiniband/core/umem.c
> >> @@ -52,30 +52,24 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
> >>   	int i;
> >>
> >>   	list_for_each_entry_safe(chunk, tmp,&umem->chunk_list, list) {
> >> -		ib_dma_unmap_sg(dev, chunk->page_list,
> >> -				chunk->nents, DMA_BIDIRECTIONAL);
> >> -		for (i = 0; i<  chunk->nents; ++i) {
> >> -			struct page *page = sg_page(&chunk->page_list[i]);
> >> -
> >> -			if (umem->writable&&  dirty)
> >> -				set_page_dirty_lock(page);
> >> -			put_page(page);
> >> -		}
> >> +		if (umem->type == IB_UMEM_MEM_MAP) {
> >> +			ib_dma_unmap_sg(dev, chunk->page_list,
> >> +					chunk->nents, DMA_BIDIRECTIONAL);
> >> +			for (i = 0; i<  chunk->nents; ++i) {
> >> +				struct page *page = sg_page(&chunk->page_list[i]);
> >>
> >> +				if (umem->writable&&  dirty)
> >> +					set_page_dirty_lock(page);
> >> +				put_page(page);
> >> +			}
> >> +		}
> >>   		kfree(chunk);
> >>   	}
> >>   }
> >>
> >> -/**
> >> - * ib_umem_get - Pin and DMA map userspace memory.
> >> - * @context: userspace context to pin memory for
> >> - * @addr: userspace virtual address to start at
> >> - * @size: length of region to pin
> >> - * @access: IB_ACCESS_xxx flags for memory being pinned
> >> - * @dmasync: flush in-flight DMA when the memory region is written
> >> - */
> >> -struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >> -			    size_t size, int access, int dmasync)
> >> +static struct ib_umem *__umem_get(struct ib_ucontext *context,
> >> +				  unsigned long addr, size_t size,
> >> +				  int access, int dmasync)
> >>   {
> >>   	struct ib_umem *umem;
> >>   	struct page **page_list;
> >> @@ -100,6 +94,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >>   	if (!umem)
> >>   		return ERR_PTR(-ENOMEM);
> >>
> >> +	umem->type	= IB_UMEM_MEM_MAP;
> >>   	umem->context   = context;
> >>   	umem->length    = size;
> >>   	umem->offset    = addr&  ~PAGE_MASK;
> >> @@ -215,6 +210,245 @@ out:
> >>
> >>   	return ret<  0 ? ERR_PTR(ret) : umem;
> >>   }
> >> +
> >> +/*
> >> + * Return the PFN for the specified address in the vma. This only
> >> + * works for a vma that is VM_PFNMAP.
> >> + */
> >> +static unsigned long __follow_io_pfn(struct vm_area_struct *vma,
> >> +				     unsigned long address, int write)
> >> +{
> >> +	pgd_t *pgd;
> >> +	pud_t *pud;
> >> +	pmd_t *pmd;
> >> +	pte_t *ptep, pte;
> >> +	spinlock_t *ptl;
> >> +	unsigned long pfn;
> >> +	struct mm_struct *mm = vma->vm_mm;
> >> +
> >> +	pgd = pgd_offset(mm, address);
> >> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> >> +		return 0;
> >> +
> >> +	pud = pud_offset(pgd, address);
> >> +	if (pud_none(*pud))
> >> +		return 0;
> >> +	if (unlikely(pud_bad(*pud)))
> >> +		return 0;
> >> +
> >> +	pmd = pmd_offset(pud, address);
> >> +	if (pmd_none(*pmd))
> >> +		return 0;
> >> +	if (unlikely(pmd_bad(*pmd)))
> >> +		return 0;
> >> +
> >> +	ptep = pte_offset_map_lock(mm, pmd, address,&ptl);
> >> +	pte = *ptep;
> >> +	if (!pte_present(pte))
> >> +		goto bad;
> >> +	if (write&&  !pte_write(pte))
> >> +		goto bad;
> >> +
> >> +	pfn = pte_pfn(pte);
> >> +	pte_unmap_unlock(ptep, ptl);
> >> +	return pfn;
> >> + bad:
> >> +	pte_unmap_unlock(ptep, ptl);
> >> +	return 0;
> >> +}
> >> +
> >> +static int __get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
> >> +			unsigned long start, int len, int write, int force,
> >> +			unsigned long *pfn_list, struct vm_area_struct **vmas)
> >> +{
> >> +	unsigned long pfn;
> >> +	int i;
> >> +	if (len<= 0)
> >> +		return 0;
> >> +
> >> +	i = 0;
> >> +	do {
> >> +		struct vm_area_struct *vma;
> >> +
> >> +		vma = find_vma(mm, start);
> >> +		if (!(vma->vm_flags&  VM_PFNMAP))
> >> +			return -EINVAL;
> >> +
> >> +		if (!(vma->vm_flags&  VM_IO))
> >> +			return -EFAULT;
> >> +
> >> +		if (is_vm_hugetlb_page(vma))
> >> +			return -EFAULT;
> >> +
> >> +		do {
> >> +			cond_resched();
> >> +			pfn = __follow_io_pfn(vma, start, write);
> >> +			if (!pfn)
> >> +				return -EFAULT;
> >> +			if (pfn_list)
> >> +				pfn_list[i] = pfn;
> >> +			if (vmas)
> >> +				vmas[i] = vma;
> >> +			i++;
> >> +			start += PAGE_SIZE;
> >> +			len--;
> >> +		} while (len&&  start<  vma->vm_end);
> >> +	} while (len);
> >> +	return i;
> >> +}
> >> +
> >> +static struct ib_umem *__iomem_get(struct ib_ucontext *context,
> >> +				   unsigned long addr, size_t size,
> >> +				   int access, int dmasync)
> >> +{
> >> +	struct ib_umem *umem;
> >> +	unsigned long *pfn_list;
> >> +	struct ib_umem_chunk *chunk;
> >> +	unsigned long locked;
> >> +	unsigned long lock_limit;
> >> +	unsigned long cur_base;
> >> +	unsigned long npages;
> >> +	int ret;
> >> +	int off;
> >> +	int i;
> >> +	DEFINE_DMA_ATTRS(attrs);
> >> +
> >> +	if (dmasync)
> >> +		dma_set_attr(DMA_ATTR_WRITE_BARRIER,&attrs);
> >> +
> >> +	if (!can_do_mlock())
> >> +		return ERR_PTR(-EPERM);
> >> +
> >> +	umem = kmalloc(sizeof *umem, GFP_KERNEL);
> >> +	if (!umem)
> >> +		return ERR_PTR(-ENOMEM);
> >> +
> >> +	umem->type	= IB_UMEM_IO_MAP;
> >> +	umem->context   = context;
> >> +	umem->length    = size;
> >> +	umem->offset    = addr&  ~PAGE_MASK;
> >> +	umem->page_size = PAGE_SIZE;
> >> +
> >> +	/*
> >> +	 * We ask for writable memory if any access flags other than
> >> +	 * "remote read" are set.  "Local write" and "remote write"
> >> +	 * obviously require write access.  "Remote atomic" can do
> >> +	 * things like fetch and add, which will modify memory, and
> >> +	 * "MW bind" can change permissions by binding a window.
> >> +	 */
> >> +	umem->writable  = !!(access&  ~IB_ACCESS_REMOTE_READ);
> >> +
> >> +	/* IO memory is not hugetlb memory */
> >> +	umem->hugetlb   = 0;
> >> +
> >> +	INIT_LIST_HEAD(&umem->chunk_list);
> >> +
> >> +	pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
> >> +	if (!pfn_list) {
> >> +		kfree(umem);
> >> +		return ERR_PTR(-ENOMEM);
> >> +	}
> >> +
> >> +	npages = PAGE_ALIGN(size + umem->offset)>>  PAGE_SHIFT;
> >> +
> >> +	down_write(&current->mm->mmap_sem);
> >> +
> >> +	locked     = npages + current->mm->locked_vm;
> >> +	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur>>  PAGE_SHIFT;
> >> +
> >> +	if ((locked>  lock_limit)&&  !capable(CAP_IPC_LOCK)) {
> >> +		ret = -ENOMEM;
> >> +		goto out;
> >> +	}
> >> +
> >> +	cur_base = addr&  PAGE_MASK;
> >> +
> >> +	ret = 0;
> >> +	while (npages) {
> >> +		ret = __get_io_pfn(current, current->mm, cur_base,
> >> +				   min_t(unsigned long, npages,
> >> +					 PAGE_SIZE / sizeof(unsigned long *)),
> >> +				   umem->writable,
> >> +				   !umem->writable, pfn_list, NULL);
> >> +
> >> +		if (ret<  0)
> >> +			goto out;
> >> +
> >> +		cur_base += ret * PAGE_SIZE;
> >> +		npages   -= ret;
> >> +
> >> +		off = 0;
> >> +
> >> +		while (ret) {
> >> +			chunk = kmalloc(sizeof *chunk +
> >> +					sizeof(struct scatterlist) *
> >> +					min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
> >> +					GFP_KERNEL);
> >> +			if (!chunk) {
> >> +				ret = -ENOMEM;
> >> +				goto out;
> >> +			}
> >> +			chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
> >> +			sg_init_table(chunk->page_list, chunk->nents);
> >> +			/* The pfn_list we built is a set of Page
> >> +			 * Frame Numbers (PFN) whose physical address
> >> +			 * is PFN<<  PAGE_SHIFT. The SG DMA mapping
> >> +			 * services expect page addresses, not PFN,
> >> +			 * therefore, we have to do the dma mapping
> >> +			 * ourselves here. */
> >> +			for (i = 0; i<  chunk->nents; ++i) {
> >> +				sg_set_page(&chunk->page_list[i], 0,
> >> +					    PAGE_SIZE, 0);
> >> +				chunk->page_list[i].dma_address =
> >> +					(pfn_list[i + off]<<  PAGE_SHIFT);
> >> +				chunk->page_list[i].dma_length = PAGE_SIZE;
> >> +			}
> >> +			chunk->nmap = chunk->nents;
> >> +			ret -= chunk->nents;
> >> +			off += chunk->nents;
> >> +			list_add_tail(&chunk->list,&umem->chunk_list);
> >> +		}
> >> +	}
> >> +
> >> +out:
> >> +	if (ret<  0) {
> >> +		__ib_umem_release(context->device, umem, 0);
> >> +		kfree(umem);
> >> +	} else
> >> +		current->mm->locked_vm = locked;
> >> +	up_write(&current->mm->mmap_sem);
> >> +	free_page((unsigned long) pfn_list);
> >> +	return ret<  0 ? ERR_PTR(ret) : umem;
> >> +}
> >> +
> >> +/**
> >> + * ib_umem_get - Pin and DMA map userspace memory.
> >> + * @context: userspace context to pin memory for
> >> + * @addr: userspace virtual address to start at
> >> + * @size: length of region to pin
> >> + * @access: IB_ACCESS_xxx flags for memory being pinned
> >> + * @dmasync: flush in-flight DMA when the memory region is written
> >> + */
> >> +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >> +			    size_t size, int access, int dmasync)
> >> +{
> >> +	/* Do a quick lookup of the containing VMA to determine it's type. */
> >> +	struct vm_area_struct *vma;
> >> +	int do_pfn_map = 0;
> >> +
> >> +	down_write(&current->mm->mmap_sem);
> >> +	vma = find_vma(current->mm, addr&  PAGE_MASK);
> >> +	if (vma)
> >> +		do_pfn_map = vma->vm_flags&  VM_PFNMAP;
> >> +	up_write(&current->mm->mmap_sem);
> >> +	if (!vma)
> >> +		return ERR_PTR(-EINVAL);
> >> +
> >> +	if (do_pfn_map)
> >> +		return __iomem_get(context, addr, size, access, dmasync);
> >> +
> >> +	return __umem_get(context, addr, size, access, dmasync);
> >> +}
> >>   EXPORT_SYMBOL(ib_umem_get);
> >>
> >>   static void ib_umem_account(struct work_struct *work)
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-12-02 21:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-02 19:02 [RFC PATCH 0/2] IB/uverbs: Add support for registering mmapped memory Tom Tucker
     [not found] ` <20101202190157.13657.58176.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-12-02 19:02   ` [RFC PATCH 1/2] IB/uverbs: Add memory type to ib_umem structure Tom Tucker
2010-12-02 19:02   ` [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory Tom Tucker
     [not found]     ` <20101202190229.13657.85728.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-12-02 19:35       ` Ralph Campbell
     [not found]         ` <1291318531.24256.272.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-12-02 20:24           ` Tom Tucker
     [not found]             ` <4CF80083.2040907-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-12-02 21:34               ` Ralph Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.