Re: [PATCH mlx5-next 4/5] RDMA/mlx5: Change the cache structure to an rbtree

From: Jason Gunthorpe <jgg@nvidia.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>,
	Aharon Landau <aharonl@nvidia.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Jason Wang <jasowang@redhat.com>,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	"Michael S. Tsirkin" <mst@redhat.com>,
	netdev@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>,
	Shay Drory <shayd@nvidia.com>,
	virtualization@lists.linux-foundation.org
Subject: Re: [PATCH mlx5-next 4/5] RDMA/mlx5: Change the cache structure to an rbtree
Date: Thu, 29 Jul 2021 16:45:43 -0300	[thread overview]
Message-ID: <20210729194543.GA2484190@nvidia.com> (raw)
In-Reply-To: <f4dbb2d090b2d97ac6ba3d88605069fa2e83fff8.1624362290.git.leonro@nvidia.com>

> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index ffb6f1d41f3d..e22eeceae9eb 100644
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -749,7 +749,7 @@ struct mlx5_cache_ent {
>  
>  
>  	char                    name[4];
> -	u32                     order;
> +	u32			order;
>  	u32			xlt;
>  	u32			access_mode;
>  	u32			page;

Looking at this, it looks like it will be a lot simpler to just store
the reference mkc here and use that whole blob as the rb key and write
a slightly special compare function.. Maybe only the ndescs needs to
be stored loose.

Then all the weirdness about special ents disappears, they are
naturally handled by having the required bits in their mkc.

And all the random encode/decode/recode scattered all over the place
goes away. Anyone working with mkeys needs to build a mkc on their
stack, then check if allocation of that mkc can be satisfied with the
cache, otherwise pass that same mkc to the alloc cmd. The one case
that uses the PAS will have to alloc a new mkc and memcpy, but that is
OK.

> +static struct mlx5_cache_ent *mkey_cache_ent_from_size(struct mlx5_ib_dev *dev,
> +						       int ent_flags, int size)
> +{
> +	struct rb_node *node = dev->cache.cache_root.rb_node;
> +	struct mlx5_cache_ent *cur, *prev = NULL;
> +
> +	WARN_ON(!mutex_is_locked(&dev->cache.cache_lock));

Yikes, no, use lockdep.

> @@ -616,13 +739,18 @@ struct mlx5_ib_mr *mlx5_alloc_special_mkey(struct mlx5_ib_dev *dev,
>  static struct mlx5r_cache_mkey *get_cache_mkey(struct mlx5_cache_ent *req_ent)
>  {
>  	struct mlx5_ib_dev *dev = req_ent->dev;
> -	struct mlx5_cache_ent *ent = req_ent;
>  	struct mlx5r_cache_mkey *cmkey;
> +	struct mlx5_cache_ent *ent;
> +	struct rb_node *node;
>  
>  	/* Try larger Mkey pools from the cache to satisfy the allocation */
> -	for (; ent != &dev->cache.ent[MKEY_CACHE_LAST_STD_ENTRY + 1]; ent++) {
> -		mlx5_ib_dbg(dev, "order %u, cache index %zu\n", ent->order,
> -			    ent - dev->cache.ent);
> +	mutex_lock(&dev->cache.cache_lock);
> +	for (node = &req_ent->node; node; node = rb_next(node)) {
> +		ent = container_of(node, struct mlx5_cache_ent, node);

See, this should be 'search for the mkc I have for the lowest entry with size+1'

> -int mlx5_mkey_cache_init(struct mlx5_ib_dev *dev)
> +int mlx5_mkey_cache_tree_init(struct mlx5_ib_dev *dev)
>  {
> -	struct mlx5_mkey_cache *cache = &dev->cache;
> +	struct mlx5_mkey_cache_tree *cache = &dev->cache;
>  	struct mlx5_cache_ent *ent;
> +	int err;
>  	int i;
>  
>  	mutex_init(&dev->slow_path_mutex);
> +	mutex_init(&cache->cache_lock);
> +	cache->cache_root = RB_ROOT;
>  	cache->wq = alloc_ordered_workqueue("mkey_cache", WQ_MEM_RECLAIM);
>  	if (!cache->wq) {
>  		mlx5_ib_warn(dev, "failed to create work queue\n");
> @@ -745,28 +882,25 @@ int mlx5_mkey_cache_init(struct mlx5_ib_dev *dev)
>  
>  	mlx5_cmd_init_async_ctx(dev->mdev, &dev->async_ctx);
>  	timer_setup(&dev->delay_timer, delay_time_func, 0);
> -	for (i = 0; i < MAX_MKEY_CACHE_ENTRIES; i++) {
> -		ent = &cache->ent[i];
> -		INIT_LIST_HEAD(&ent->head);
> -		spin_lock_init(&ent->lock);
> -		ent->order = i + 2;
> -		ent->dev = dev;
> -		ent->limit = 0;
> -
> -		INIT_WORK(&ent->work, cache_work_func);
> -		INIT_DELAYED_WORK(&ent->dwork, delayed_cache_work_func);
> +	for (i = 0; i < MAX_MKEY_CACHE_DEFAULT_ENTRIES; i++) {
> +		u8 order = i + 2;
> +		u32 xlt_size = (1 << order) * sizeof(struct mlx5_mtt) /
> +			       MLX5_IB_UMR_OCTOWORD;

This should be written saner

for (xlt_size = MKEY_CACHE_DEFAULT_MIN_DESCS * sizeof(struct mlx5_mtt) / MLX5_IB_UMR_OCTOWORD; 
     xlt_size <= MKEY_CACHE_DEFAULT_MAX_DESCS *  sizeof(struct mlx5_mtt) / MLX5_IB_UMR_OCTOWORD; 
     xlt_size *= 2)

>  
>  		if (i > MKEY_CACHE_LAST_STD_ENTRY) {

The index in the cache should be meaningless now, so don't put this
code here.

> -			mlx5_odp_init_mkey_cache_entry(ent);
> +			err = mlx5_odp_init_mkey_cache_entry(dev, i);
> +			if (err)
> +				return err;
>  			continue;
>  		}

> -		if (ent->order > mkey_cache_max_order(dev))
> +		ent = mlx5_ib_create_cache_ent(dev, 0, xlt_size, order);

And why do we need to pass in order, why is it stored in the
cache_ent? Looks like it should be removed

The debugfs looks like it might need some rethink as is it can only
control the original buckets, the new buckets don't get exposed. Seems
like trouble.

If just exposing the legacy things is the idea then it should have the
same sweep over the parameter space as above, not just assume that the
rb tree is in order and only contains debugfs entries.

Probably change it to create the debugfs nodes at the same time the
cache entry itself is created.

> @@ -973,14 +1100,16 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd,
>  						     0, iova);
>  	if (WARN_ON(!page_size))
>  		return ERR_PTR(-EINVAL);
> -	ent = mkey_cache_ent_from_order(
> -		dev, order_base_2(ib_umem_num_dma_blocks(umem, page_size)));
> +	ent_flags = mlx5_ent_access_flags(dev, access_flags);
> +	xlt_size = get_octo_len(iova, umem->length, order_base_2(page_size));
> +	mutex_lock(&dev->cache.cache_lock);
> +	ent = mkey_cache_ent_from_size(dev, ent_flags, xlt_size);

See here is where I wonder if it is just better to build the mkc on
the stack in one place instead of having all this stuff open coded all
over..

> -void mlx5_odp_init_mkey_cache_entry(struct mlx5_cache_ent *ent)
> +int mlx5_odp_init_mkey_cache_entry(struct mlx5_ib_dev *dev, int ent_num)
>  {
> -	if (!(ent->dev->odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
> -		return;
> +	struct mlx5_cache_ent *ent;
> +	int ent_flags;
> +	u32 xlt_size;
> +
> +	if (!(dev->odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
> +		return 0;
>  
> -	switch (ent->order - 2) {
> +	switch (ent_num) {
>  	case MLX5_IMR_MTT_CACHE_ENTRY:

Don't do stuff like this either. The mkc scheme will fix this too as
this will just create two mkcs for these unique usages and store them
normally.

Jason