All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>,
	Dong Jia Shi <bjsdjshi@linux.ibm.com>,
	"Halil Pasic" <pasic@linux.ibm.com>
Subject: Re: [PATCH v3] vfio/mdev: Check globally for duplicate devices
Date: Fri, 18 May 2018 01:56:50 +0530	[thread overview]
Message-ID: <3ab0fe44-9290-8c29-5e4c-2599ba0de373@nvidia.com> (raw)
In-Reply-To: <20180517102000.50656267@w520.home>



On 5/17/2018 9:50 PM, Alex Williamson wrote:
> On Thu, 17 May 2018 21:25:22 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 5/17/2018 1:39 PM, Cornelia Huck wrote:
>>> On Wed, 16 May 2018 21:30:19 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> When we create an mdev device, we check for duplicates against the
>>>> parent device and return -EEXIST if found, but the mdev device
>>>> namespace is global since we'll link all devices from the bus.  We do
>>>> catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
>>>> with it comes a kernel warning and stack trace for trying to create
>>>> duplicate sysfs links, which makes it an undesirable response.
>>>>
>>>> Therefore we should really be looking for duplicates across all mdev
>>>> parent devices, or as implemented here, against our mdev device list.
>>>> Using mdev_list to prevent duplicates means that we can remove
>>>> mdev_parent.lock, but in order not to serialize mdev device creation
>>>> and removal globally, we add mdev_device.active which allows UUIDs to
>>>> be reserved such that we can drop the mdev_list_lock before the mdev
>>>> device is fully in place.
>>>>
>>>> NB. there was never intended to be any serialization guarantee
>>>> provided by the mdev core with respect to creation and removal of mdev
>>>> devices, mdev_parent.lock provided this only as a side-effect of the
>>>> implementation for locking the namespace per parent.  That
>>>> serialization is now removed.  
>>>   
>>
>> mdev_parent.lock is to serialize create and remove of that mdev device,
>> that handles race condition that Cornelia mentioned below.
> 
> Previously it was stated:
> 
> On Thu, 17 May 2018 01:01:40 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>> Here lock is not for create/remove routines of vendor driver, its about
>> mdev device creation and device registration, which is a common code
>> path, and so is part of mdev core module.
> 
> So the race condition was handled previously, but as a side-effect of
> protecting the namespace, aiui.  I'm trying to state above that the
> serialization of create/remove was never intended as a guarantee
> provided to mdev vendor drivers.  I don't see that there's a need to
> protect "mdev device creation and device registration" beyond conflicts
> in the UUID namespace, which is done here.  Are there others?
> 

Sorry not being elaborative in my earlier response to

> If we can
> show that vendor drivers handle the create/remove paths themselves,
> perhaps we can refine the locking granularity.

mdev_device_create() function does :
- create mdev device
- register device
- call vendor driver->create
- create sysfs files.

mdev_device_remove() removes sysfs files, unregister device and delete
device.

There is common code in mdev_device_create() and mdev_device_remove()
independent of what vendor driver does in its create()/remove()
callback. Moving this code to each vendor driver to handle create/remove
themselves doesn't make sense to me.

mdev_parent.lock here does take care of race conditions that could occur
during mdev device creation and deletion in this common code path.

What is the urge to remove mdev_parent.lock if that handles all race
conditions without bothering user to handle -EAGAIN?

Thanks,
Kirti


>>> This is probably fine; but I noted that documentation on the locking
>>> conventions and serialization guarantees for mdev is a bit sparse, and
>>> this topic also came up during the vfio-ap review.
>>>
>>> We probably want to add some more concrete documentation; would the
>>> kernel doc for the _ops or vfio-mediated-device.txt be a better place
>>> for that?
> 
> I'll look to see where we can add a note withing that file, I suspect
> that's the right place to put it.
> 
>>> [Dong Jia, Halil: Can you please take a look whether vfio-ccw is really
>>> ok? I don't think we open up any new races, but I'd appreciate a second
>>> or third opinion.]
>>>   
>>>>
>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>> ---
>>>>
>>>> v3: Rework locking and add a field to mdev_device so we can track
>>>>     completed instances vs those added to reserve the namespace.
>>>>
>>>>  drivers/vfio/mdev/mdev_core.c    |   94 +++++++++++++-------------------------
>>>>  drivers/vfio/mdev/mdev_private.h |    2 -
>>>>  2 files changed, 34 insertions(+), 62 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
>>>> index 126991046eb7..55ea9d34ec69 100644
>>>> --- a/drivers/vfio/mdev/mdev_core.c
>>>> +++ b/drivers/vfio/mdev/mdev_core.c
>>>> @@ -66,34 +66,6 @@ uuid_le mdev_uuid(struct mdev_device *mdev)
>>>>  }
>>>>  EXPORT_SYMBOL(mdev_uuid);
>>>>  
>>>> -static int _find_mdev_device(struct device *dev, void *data)
>>>> -{
>>>> -	struct mdev_device *mdev;
>>>> -
>>>> -	if (!dev_is_mdev(dev))
>>>> -		return 0;
>>>> -
>>>> -	mdev = to_mdev_device(dev);
>>>> -
>>>> -	if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
>>>> -		return 1;
>>>> -
>>>> -	return 0;
>>>> -}
>>>> -
>>>> -static bool mdev_device_exist(struct mdev_parent *parent, uuid_le uuid)
>>>> -{
>>>> -	struct device *dev;
>>>> -
>>>> -	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
>>>> -	if (dev) {
>>>> -		put_device(dev);
>>>> -		return true;
>>>> -	}
>>>> -
>>>> -	return false;
>>>> -}
>>>> -
>>>>  /* Should be called holding parent_list_lock */
>>>>  static struct mdev_parent *__find_parent_device(struct device *dev)
>>>>  {
>>>> @@ -221,7 +193,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>>>>  	}
>>>>  
>>>>  	kref_init(&parent->ref);
>>>> -	mutex_init(&parent->lock);
>>>>  
>>>>  	parent->dev = dev;
>>>>  	parent->ops = ops;
>>>> @@ -304,7 +275,7 @@ static void mdev_device_release(struct device *dev)
>>>>  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>>>>  {
>>>>  	int ret;
>>>> -	struct mdev_device *mdev;
>>>> +	struct mdev_device *mdev, *tmp;
>>>>  	struct mdev_parent *parent;
>>>>  	struct mdev_type *type = to_mdev_type(kobj);
>>>>  
>>>> @@ -312,21 +283,26 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>>>>  	if (!parent)
>>>>  		return -EINVAL;
>>>>  
>>>> -	mutex_lock(&parent->lock);
>>>> +	mutex_lock(&mdev_list_lock);
>>>>  
>>>>  	/* Check for duplicate */
>>>> -	if (mdev_device_exist(parent, uuid)) {
>>>> -		ret = -EEXIST;
>>>> -		goto create_err;
>>>> +	list_for_each_entry(tmp, &mdev_list, next) {
>>>> +		if (!uuid_le_cmp(tmp->uuid, uuid)) {
>>>> +			mutex_unlock(&mdev_list_lock);
>>>> +			return -EEXIST;
>>>> +		}
>>>>  	}
>>>>    
>>
>> mdev_put_parent(parent) missing before return.
>>
>>
>>>>  	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
>>>>  	if (!mdev) {
>>>> -		ret = -ENOMEM;
>>>> -		goto create_err;
>>>> +		mutex_unlock(&mdev_list_lock);
>>>> +		return -ENOMEM;
>>>>  	}
>>>>  
>>
>> mdev_put_parent(parent) missing here again.
> 
> 
> Oops, will fix both.
> 
> 
>>>>  	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
>>>> +	list_add(&mdev->next, &mdev_list);
>>>> +	mutex_unlock(&mdev_list_lock);
>>>> +
>>>>  	mdev->parent = parent;
>>>>  	kref_init(&mdev->ref);
>>>>  
>>>> @@ -352,21 +328,18 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>>>>  	}
>>>>  
>>>>  	mdev->type_kobj = kobj;
>>>> +	mdev->active = true;
>>>>  	dev_dbg(&mdev->dev, "MDEV: created\n");
>>>>  
>>>> -	mutex_unlock(&parent->lock);
>>>> -
>>>> -	mutex_lock(&mdev_list_lock);
>>>> -	list_add(&mdev->next, &mdev_list);
>>>> -	mutex_unlock(&mdev_list_lock);
>>>> -
>>>> -	return ret;
>>>> +	return 0;
>>>>  
>>>>  create_failed:
>>>>  	device_unregister(&mdev->dev);
>>>>  
>>>>  create_err:
>>>> -	mutex_unlock(&parent->lock);
>>>> +	mutex_lock(&mdev_list_lock);
>>>> +	list_del(&mdev->next);
>>>> +	mutex_unlock(&mdev_list_lock);
>>>>  	mdev_put_parent(parent);
>>>>  	return ret;
>>>>  }
>>>> @@ -377,44 +350,43 @@ int mdev_device_remove(struct device *dev, bool force_remove)
>>>>  	struct mdev_parent *parent;
>>>>  	struct mdev_type *type;
>>>>  	int ret;
>>>> -	bool found = false;
>>>>  
>>>>  	mdev = to_mdev_device(dev);
>>>>  
>>>>  	mutex_lock(&mdev_list_lock);
>>>>  	list_for_each_entry(tmp, &mdev_list, next) {
>>>> -		if (tmp == mdev) {
>>>> -			found = true;
>>>> +		if (tmp == mdev)
>>>>  			break;
>>>> -		}
>>>>  	}
>>>>  
>>>> -	if (found)
>>>> -		list_del(&mdev->next);
>>>> +	if (tmp != mdev) {
>>>> +		mutex_unlock(&mdev_list_lock);
>>>> +		return -ENODEV;
>>>> +	}
>>>>  
>>>> -	mutex_unlock(&mdev_list_lock);
>>>> +	if (!mdev->active) {
>>>> +		mutex_unlock(&mdev_list_lock);
>>>> +		return -EAGAIN;
>>>> +	}  
>>>
>>> I'm not sure whether this is 100% watertight. Consider:
>>>
>>> - device gets registered, we have added it to the list, made it visible
>>>   in sysfs and have added the remove attribute, but not yet the symlinks
>>> - userspace can access the remove attribute and trigger removal
>>> - we do an early exit here because not yet active
>>> - ???
>>>
>>> (If there's any problem, it's a very pathological case, and I don't
>>> think anything really bad can happen. I just want to make sure we don't
>>> miss anything.)
> 
> The presented scenario is exactly the use case that the -EAGAIN return
> is intended to handle.  I can't put a mutex on the mdev_device to block
> this path as the mdev creation might ultimately fail and the device
> released (perhaps not possible in our code flow, but that would be a
> very subtle detail to rely on).  So any sort of blocking approach would
> require releasing mdev_list_lock and re-scanning in a busy loop.  Why
> bother to do that when we can indicate the same to the user through
> -EAGAIN.  AIUI, this is the purpose of -EAGAIN and it's up to userspace
> to decide how they'd like to handle it, try again or abort.  Are there
> suggestions for alternatives?  Thanks,
> 

  reply	other threads:[~2018-05-17 20:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-17  3:30 [PATCH v3] vfio/mdev: Check globally for duplicate devices Alex Williamson
2018-05-17  8:09 ` Cornelia Huck
2018-05-17 15:55   ` Kirti Wankhede
2018-05-17 16:20     ` Alex Williamson
2018-05-17 20:26       ` Kirti Wankhede [this message]
2018-05-17 21:37         ` Alex Williamson
2018-05-17 22:17           ` Alex Williamson
2018-05-18  7:04           ` Kirti Wankhede
2018-05-18 14:05             ` Cornelia Huck
2018-05-18 17:30             ` Alex Williamson
2018-05-18 19:03               ` Kirti Wankhede
2018-05-18 13:28   ` Halil Pasic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ab0fe44-9290-8c29-5e4c-2599ba0de373@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=bjsdjshi@linux.ibm.com \
    --cc=cohuck@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pasic@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.