All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Michal Kalderon <michal.kalderon@marvell.com>
Cc: mkalderon@marvell.com, aelior@marvell.com, dledford@redhat.com,
	bmt@zurich.ibm.com, galpress@amazon.com, sleybo@amazon.com,
	leon@kernel.org, linux-rdma@vger.kernel.org,
	Ariel Elior <ariel.elior@marvell.com>
Subject: Re: [PATCH v7 rdma-next 2/7] RDMA/core: Create mmap database and cookie helper functions
Date: Tue, 20 Aug 2019 10:21:25 -0300	[thread overview]
Message-ID: <20190820132125.GC29246@ziepe.ca> (raw)
In-Reply-To: <20190820121847.25871-3-michal.kalderon@marvell.com>

On Tue, Aug 20, 2019 at 03:18:42PM +0300, Michal Kalderon wrote:
> Create some common API's for adding entries to a xa_mmap.
> Searching for an entry and freeing one.
> 
> Most of the code was copied from the efa driver almost as is, just renamed
> function to be generic and not efa specific.
> In addition to original code, the xa_mmap entries are now linked
> to a umap_priv object and reference counted according to umap operations.
> The fact that this code moved to core enabled managing it differently,
> so that now entries can be removed and deleted when driver+user are
> done with them. This enabled changing the insert algorithm in
> comparison to what was done in efa.
> 
> Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
>  drivers/infiniband/core/core_priv.h      |  12 +-
>  drivers/infiniband/core/device.c         |   1 +
>  drivers/infiniband/core/ib_core_uverbs.c | 343 +++++++++++++++++++++++++++++--
>  drivers/infiniband/core/rdma_core.c      |   1 +
>  drivers/infiniband/core/uverbs_cmd.c     |   1 +
>  drivers/infiniband/core/uverbs_main.c    |  18 +-
>  include/rdma/ib_verbs.h                  |  41 +++-
>  7 files changed, 381 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
> index 6850e973401c..4951ecfbf133 100644
> +++ b/drivers/infiniband/core/core_priv.h
> @@ -388,9 +388,17 @@ void rdma_nl_net_exit(struct rdma_dev_net *rnet);
>  struct rdma_umap_priv {
>  	struct vm_area_struct *vma;
>  	struct list_head list;
> +	struct rdma_user_mmap_entry *entry;
>  };
>  
> -void rdma_umap_priv_init(struct rdma_umap_priv *priv,
> -			 struct vm_area_struct *vma);
> +int rdma_umap_priv_init(struct vm_area_struct *vma,
> +			struct rdma_user_mmap_entry *entry);
> +
> +void rdma_umap_priv_delete(struct ib_uverbs_file *ufile,
> +			   struct rdma_umap_priv *priv);
> +
> +void rdma_user_mmap_entries_remove_free(struct ib_ucontext *ucontext);
> +void rdma_user_mmap_entry_put(struct ib_ucontext *ucontext,
> +			      struct rdma_user_mmap_entry *entry);
>  
>  #endif /* _CORE_PRIV_H */
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 8892862fb759..229977237d1a 100644
> +++ b/drivers/infiniband/core/device.c
> @@ -2594,6 +2594,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
>  	SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
>  	SET_DEVICE_OP(dev_ops, map_phys_fmr);
>  	SET_DEVICE_OP(dev_ops, mmap);
> +	SET_DEVICE_OP(dev_ops, mmap_free);
>  	SET_DEVICE_OP(dev_ops, modify_ah);
>  	SET_DEVICE_OP(dev_ops, modify_cq);
>  	SET_DEVICE_OP(dev_ops, modify_device);
> diff --git a/drivers/infiniband/core/ib_core_uverbs.c b/drivers/infiniband/core/ib_core_uverbs.c
> index cab7dc922cf0..cce20172cd71 100644
> +++ b/drivers/infiniband/core/ib_core_uverbs.c
> @@ -36,41 +36,98 @@
>  #include "uverbs.h"
>  #include "core_priv.h"
>  
> -/*
> - * Each time we map IO memory into user space this keeps track of the mapping.
> - * When the device is hot-unplugged we 'zap' the mmaps in user space to point
> - * to the zero page and allow the hot unplug to proceed.
> +/**
> + * rdma_umap_priv_init() - Initialize the private data of a vma
> + *
> + * @vma: The vm area struct that needs private data
> + * @entry: entry into the mmap_xa that needs to be linked with
> + *       this vma
> + *
> + * Each time we map IO memory into user space this keeps track
> + * of the mapping. When the device is hot-unplugged we 'zap' the
> + * mmaps in user space to point to the zero page and allow the
> + * hot unplug to proceed.
>   *
>   * This is necessary for cases like PCI physical hot unplug as the actual BAR
>   * memory may vanish after this and access to it from userspace could MCE.
>   *
>   * RDMA drivers supporting disassociation must have their user space designed
>   * to cope in some way with their IO pages going to the zero page.
> + *
> + * We extended the umap list usage to track all memory that was mapped by
> + * user space and not only the IO memory. This will occur for drivers that use
> + * the mmap_xa database and helper functions
> + *
> + * Return 0 on success or -ENOMEM if out of memory
>   */
> -void rdma_umap_priv_init(struct rdma_umap_priv *priv,
> -			 struct vm_area_struct *vma)
> +int rdma_umap_priv_init(struct vm_area_struct *vma,
> +			struct rdma_user_mmap_entry *entry)
>  {
>  	struct ib_uverbs_file *ufile = vma->vm_file->private_data;
> +	struct rdma_umap_priv *priv;
> +
> +	/* If the xa_mmap is used, private data will already be initialized.
> +	 * this is required for the cases that rdma_user_mmap_io is called
> +	 * from drivers that don't use the xa_mmap database yet
> +	 */
> +	if (vma->vm_private_data)
> +		return 0;

?? Still have to track the ufile->umaps though

> +/**
> + * rdma_user_mmap_entry_put() - drop reference to the mmap entry
> + *
> + * @ucontext: associated user context.
> + * @entry: An entry in the mmap_xa.
> + *
> + * This function is called when the mapping is closed or when
> + * the driver is done with the entry for some other reason.
> + * Should be called after rdma_user_mmap_entry_get was called
> + * and entry is no longer needed. This function will erase the
> + * entry and free it if it's refcnt reaches zero.
> + */
> +void rdma_user_mmap_entry_put(struct ib_ucontext *ucontext,
> +			      struct rdma_user_mmap_entry *entry)
> +{
> +	WARN_ON(!kref_read(&entry->ref));

kref_put does this internally when refcount debugging is enabled

> +	kref_put(&entry->ref, rdma_user_mmap_entry_free);
> +}
> +EXPORT_SYMBOL(rdma_user_mmap_entry_put);
> +
> +/**
> + * rdma_user_mmap_entry_remove() - Remove a key's entry from the mmap_xa
> + *
> + * @ucontext: associated user context.
> + * @key: The key to be deleted
> + *
> + * This function will find if there is an entry matching the key and if so
> + * decrease it's refcnt, which will in turn delete the entry if its refcount
> + * reaches zero.
> + */
> +void rdma_user_mmap_entry_remove(struct ib_ucontext *ucontext, u64 key)
> +{
> +	struct rdma_user_mmap_entry *entry;
> +	u32 mmap_page;
> +
> +	if (key == RDMA_USER_MMAP_INVALID)
> +		return;
> +
> +	mmap_page = key >> PAGE_SHIFT;
> +	if (mmap_page > U32_MAX)
> +		return;
> +
> +	entry = xa_load(&ucontext->mmap_xa, mmap_page);
> +	if (!entry)
> +		return;
> +
> +	rdma_user_mmap_entry_put(ucontext, entry);
> +}
> +EXPORT_SYMBOL(rdma_user_mmap_entry_remove);
> +
> +/**
> + * rdma_user_mmap_entry_insert() - Allocate and insert an entry to the mmap_xa.
> + *
> + * @ucontext: associated user context.
> + * @obj: opaque driver object that will be stored in the entry.
> + * @address: The address that will be mmapped to the user
> + * @length: Length of the address that will be mmapped
> + * @mmap_flag: opaque driver flags related to the address (For
> + *           example could be used for cachability)
> + *
> + * This function should be called by drivers that use the rdma_user_mmap
> + * interface for handling user mmapped addresses. The database is handled in
> + * the core and helper functions are provided to insert entries into the
> + * database and extract entries when the user call mmap with the given key.
> + * The function returns a unique key that should be provided to user, the user
> + * will use the key to map the given address.
> + *
> + * Return: unique key or RDMA_USER_MMAP_INVALID if entry was not added.
> + */
> +u64 rdma_user_mmap_entry_insert(struct ib_ucontext *ucontext, void *obj,
> +				u64 address, u64 length, u8 mmap_flag)
> +{
> +	XA_STATE(xas, &ucontext->mmap_xa, 0);
> +	struct rdma_user_mmap_entry *entry;
> +	unsigned long index = 0, index_max;
> +	u32 xa_first, xa_last, npages;
> +	int err, i;
> +	void *ent;
> +
> +	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> +	if (!entry)
> +		return RDMA_USER_MMAP_INVALID;
> +
> +	entry->obj = obj;

It is more a kernel pattern to have the driver allocate a
rdma_user_mmap_entry and extend it with its 'priv', then use
container_of


> +	entry->address = address;
> +	entry->length = length;
> +	kref_init(&entry->ref);
> +	entry->mmap_flag = mmap_flag;
> +	entry->ucontext = ucontext;
> +
> +	xa_lock(&ucontext->mmap_xa);
> +
> +	/* We want to find an empty range */
> +	npages = (u32)DIV_ROUND_UP(length, PAGE_SIZE);
> +	do {
> +		/* First find an empty index */
> +		xas_find_marked(&xas, U32_MAX, XA_FREE_MARK);
> +		if (xas.xa_node == XAS_RESTART)
> +			goto err_unlock;
> +
> +		xa_first = xas.xa_index;
> +
> +		/* Is there enough room to have the range? */
> +		if (check_add_overflow(xa_first, npages, &xa_last))
> +			goto err_unlock;
> +
> +		/* Iterate over all present entries in the range. If a present
> +		 * entry exists we will finish this with the largest index
> +		 * occupied in the range which will serve as the start of the
> +		 * new search
> +		 */
> +		index_max = xa_last;
> +		xa_for_each_start(&ucontext->mmap_xa, index, ent, xa_first)

I think this can just be written as xas_next_entry() ?

And if it returns something we know the range xa_first -> xas.xa_index
is not occupied, then check if it has the right size? Otherwise the
range xa_first -> U32_MAX


> +			if (index < xa_last)
> +				index_max = index;
> +			else
> +				break;
> +		if (index_max == xa_last) /* range is free */
> +			break;
> +		/* o/w start again from largest index found in range */
> +		xas_set(&xas, index_max);
> +	} while (true);
> +
> +	for (i = xa_first; i < xa_last; i++) {
> +		err = __xa_insert(&ucontext->mmap_xa, i, entry, GFP_KERNEL);

Hum, keep in mind this is a bit tricky as the __xa_insert will drop
the xa_lock lock to allocate and a parallel thread could jump into the
gap

This seems undesirable, so we probably need to enclose the whole thing
in a sleeping mutex. Can probably use the umap_lock

> +void rdma_user_mmap_entries_remove_free(struct ib_ucontext *ucontext)
> +{
> +	struct rdma_user_mmap_entry *entry;
> +	unsigned long mmap_page;
> +
> +	WARN_ON(!xa_empty(&ucontext->mmap_xa));
> +	xa_for_each(&ucontext->mmap_xa, mmap_page, entry) {
> +		ibdev_dbg(ucontext->device,
> +			  "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx] removed\n",
> +			  entry->obj, rdma_user_mmap_get_key(entry),
> +			  entry->address, entry->length);
> +
> +		/* override the refcnt to make sure entry is deleted */
> +		kref_init(&entry->ref);

Yikes, no. The zap flow has to clean this up so the kref goes
naturally to zero.

> +		rdma_user_mmap_entry_put(ucontext, entry);
> +	}
> +}
> +EXPORT_SYMBOL(rdma_user_mmap_entries_remove_free);
> diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c
> index ccf4d069c25c..7166741834c8 100644
> +++ b/drivers/infiniband/core/rdma_core.c
> @@ -817,6 +817,7 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
>  	rdma_restrack_del(&ucontext->res);
>  
>  	ib_dev->ops.dealloc_ucontext(ucontext);
> +	rdma_user_mmap_entries_remove_free(ucontext);
>  	kfree(ucontext);
>  
>  	ufile->ucontext = NULL;
> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> index 7ddd0e5bc6b3..4903e6eee854 100644
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -254,6 +254,7 @@ static int ib_uverbs_get_context(struct uverbs_attr_bundle *attrs)
>  
>  	mutex_init(&ucontext->per_mm_list_lock);
>  	INIT_LIST_HEAD(&ucontext->per_mm_list);
> +	xa_init_flags(&ucontext->mmap_xa, XA_FLAGS_ALLOC);
>  
>  	ret = get_unused_fd_flags(O_CLOEXEC);
>  	if (ret < 0)
> diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> index 180a5e0f70e4..80d0d3467d93 100644
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -802,7 +802,7 @@ static void rdma_umap_open(struct vm_area_struct *vma)
>  {
>  	struct ib_uverbs_file *ufile = vma->vm_file->private_data;
>  	struct rdma_umap_priv *opriv = vma->vm_private_data;
> -	struct rdma_umap_priv *priv;
> +	int ret;
>  
>  	if (!opriv)
>  		return;
> @@ -816,10 +816,12 @@ static void rdma_umap_open(struct vm_area_struct *vma)
>  	if (!ufile->ucontext)
>  		goto out_unlock;
>  
> -	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> -	if (!priv)
> +	if (opriv->entry)
> +		kref_get(&opriv->entry->ref);
> +
> +	ret = rdma_umap_priv_init(vma, opriv->entry);
> +	if (ret)
>  		goto out_unlock;
> -	rdma_umap_priv_init(priv, vma);
>  
>  	up_read(&ufile->hw_destroy_rwsem);
>  	return;
> @@ -844,15 +846,15 @@ static void rdma_umap_close(struct vm_area_struct *vma)
>  	if (!priv)
>  		return;
>  
> +	if (priv->entry)
> +		rdma_user_mmap_entry_put(ufile->ucontext, priv->entry);
> +
>  	/*
>  	 * The vma holds a reference on the struct file that created it, which
>  	 * in turn means that the ib_uverbs_file is guaranteed to exist at
>  	 * this point.
>  	 */
> -	mutex_lock(&ufile->umap_lock);
> -	list_del(&priv->list);
> -	mutex_unlock(&ufile->umap_lock);
> -	kfree(priv);
> +	rdma_umap_priv_delete(ufile, priv);
>  }
>  
>  /*
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 391499008a22..b66c197a7079 100644
> +++ b/include/rdma/ib_verbs.h
> @@ -1479,6 +1479,7 @@ struct ib_ucontext {
>  	 * Implementation details of the RDMA core, don't use in drivers:
>  	 */
>  	struct rdma_restrack_entry res;
> +	struct xarray mmap_xa;
>  };
>  
>  struct ib_uobject {
> @@ -2259,6 +2260,19 @@ struct iw_cm_conn_param;
>  
>  #define DECLARE_RDMA_OBJ_SIZE(ib_struct) size_t size_##ib_struct
>  
> +#define RDMA_USER_MMAP_FLAG_SHIFT 56
> +#define RDMA_USER_MMAP_PAGE_MASK GENMASK(EFA_MMAP_FLAG_SHIFT - 1, 0)

Why is something called EFA_MMAP_FLAGS_SHIFT in ib_verbs.h?

Jason

  reply	other threads:[~2019-08-20 13:21 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-20 12:18 [PATCH v7 rdma-next 0/7] RDMA/qedr: Use the doorbell overflow recovery mechanism for RDMA Michal Kalderon
2019-08-20 12:18 ` [PATCH v7 rdma-next 1/7] RDMA/core: Move core content from ib_uverbs to ib_core Michal Kalderon
2019-08-20 12:58   ` Jason Gunthorpe
2019-08-20 21:30     ` Michal Kalderon
2019-08-21 16:30       ` Michal Kalderon
2019-08-20 14:08   ` Gal Pressman
2019-08-20 21:32     ` Michal Kalderon
2019-08-21  6:06       ` Gal Pressman
2019-08-21  7:56         ` Michal Kalderon
2019-08-20 12:18 ` [PATCH v7 rdma-next 2/7] RDMA/core: Create mmap database and cookie helper functions Michal Kalderon
2019-08-20 13:21   ` Jason Gunthorpe [this message]
2019-08-20 21:23     ` [EXT] " Michal Kalderon
2019-08-21 16:47       ` Michal Kalderon
2019-08-21 16:51         ` Jason Gunthorpe
2019-08-21 17:14           ` Michal Kalderon
2019-08-21 17:37             ` Jason Gunthorpe
2019-08-26 11:53               ` Michal Kalderon
2019-08-26 12:01                 ` Gal Pressman
2019-08-22  8:35   ` Gal Pressman
2019-08-25  8:36     ` Michal Kalderon
2019-08-25 10:39       ` Gal Pressman
2019-08-26  8:41         ` Michal Kalderon
2019-08-26 15:30       ` Michal Kalderon
2019-08-20 12:18 ` [PATCH v7 rdma-next 3/7] RDMA/efa: Use the common mmap_xa helpers Michal Kalderon
2019-08-22 13:18   ` Gal Pressman
2019-08-25  8:41     ` Michal Kalderon
2019-08-25 10:45       ` Gal Pressman
2019-08-26  8:42         ` Michal Kalderon
2019-08-20 12:18 ` [PATCH v7 rdma-next 4/7] RDMA/siw: " Michal Kalderon
2019-08-20 12:18 ` [PATCH v7 rdma-next 5/7] RDMA/qedr: Use the common mmap API Michal Kalderon
2019-08-20 12:18 ` [PATCH v7 rdma-next 6/7] RDMA/qedr: Add doorbell overflow recovery support Michal Kalderon
2019-08-20 12:18 ` [PATCH v7 rdma-next 7/7] RDMA/qedr: Add iWARP doorbell " Michal Kalderon
2019-08-20 18:31 ` [PATCH v7 rdma-next 0/7] RDMA/qedr: Use the doorbell overflow recovery mechanism for RDMA Gal Pressman
2019-08-21  8:03   ` Michal Kalderon
2019-08-21 10:15     ` Gal Pressman
2019-08-21 10:32       ` Michal Kalderon
2019-08-21 10:41         ` Gal Pressman
2019-08-21 12:25           ` Gal Pressman
2019-08-21 16:23             ` Gal Pressman
2019-08-21 16:27               ` Michal Kalderon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190820132125.GC29246@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=aelior@marvell.com \
    --cc=ariel.elior@marvell.com \
    --cc=bmt@zurich.ibm.com \
    --cc=dledford@redhat.com \
    --cc=galpress@amazon.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=michal.kalderon@marvell.com \
    --cc=mkalderon@marvell.com \
    --cc=sleybo@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.