All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kirti Wankhede <kwankhede@nvidia.com>
To: Parav Pandit <parav@mellanox.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>
Cc: Neo Jia <cjia@nvidia.com>
Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
Date: Thu, 28 Mar 2019 22:50:38 +0530	[thread overview]
Message-ID: <69027775-7ca9-ff69-f1cc-1dc9b2001136@nvidia.com> (raw)
In-Reply-To: <VI1PR0501MB2271EB7C5A92CEA195A2FA59D15F0@VI1PR0501MB2271.eurprd05.prod.outlook.com>



On 3/26/2019 9:00 PM, Parav Pandit wrote:
> 
> 
>> -----Original Message-----
>> From: Kirti Wankhede <kwankhede@nvidia.com>
>> Sent: Tuesday, March 26, 2019 2:06 AM
>> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
>> kernel@vger.kernel.org; alex.williamson@redhat.com
>> Cc: Neo Jia <cjia@nvidia.com>
>> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
>>
>>
>>
>> On 3/23/2019 4:50 AM, Parav Pandit wrote:
>>> There are five problems with current code structure.
>>> 1. mdev device is placed on the mdev bus before it is created in the
>>> vendor driver. Once a device is placed on the mdev bus without
>>> creating its supporting underlying vendor device, an open() can get
>>> triggered by userspace on partially initialized device.
>>> Below ladder diagram highlight it.
>>>
>>>       cpu-0                                       cpu-1
>>>       -----                                       -----
>>>    create_store()
>>>      mdev_create_device()
>>>        device_register()
>>>           ...
>>>          vfio_mdev_probe()
>>>          ...creates char device
>>>                                         vfio_mdev_open()
>>>                                           parent->ops->open(mdev)
>>>                                             vfio_ap_mdev_open()
>>>                                               matrix_mdev = NULL
>>>         [...]
>>>         parent->ops->create()
>>>           vfio_ap_mdev_create()
>>>             mdev_set_drvdata(mdev, matrix_mdev);
>>>             /* Valid pointer set above */
>>>
>>
>> VFIO interface uses sysfs path of device or PCI device's BDF where it checks
>> sysfs file for that device exist.
>> In case of VFIO mdev device, above situation will never happen as open will
>> only get called if sysfs entry for that device exist.
>>
>> If you don't use VFIO interface then this situation can arise. In that case
>> probe() can be used for very basic initialization then create actual char
>> device from create().
>>
> I explained you that create() cannot do the heavy lifting work of creating netdev and rdma dev because at that stage driver doesn't know whether its getting used for VM or host.
> create() needs to create the device that probe() can work on in stable manner.
> 

You can identify if its getting used by VM or host from create(). Since
probe() happens first, from create() you can check
mdev_dev(mdev)->driver->name, if its 'vfio_mdev' then its getting used
by VM, otherwise used by host.

>>
>>> 2. Current creation sequence is,
>>>    parent->ops_create()
>>>    groups_register()
>>>
>>> Remove sequence is,
>>>    parent->ops->remove()
>>>    groups_unregister()
>>> However, remove sequence should be exact mirror of creation sequence.
>>> Once this is achieved, all users of the mdev will be terminated first
>>> before removing underlying vendor device.
>>> (Follow standard linux driver model).
>>> At that point vendor's remove() ops shouldn't failed because device is
>>> taken off the bus that should terminate the users.
>>>
>>
>> If VMM or user space application is using mdev device,
>> parent->ops->remove() can return failure. In that case sysfs files
>> shouldn't be removed. Hence above sequence is followed for remove.
>>
>> Standard linux driver model doesn't allow remove() to fail, but in of mdev
>> framework, interface is defined to handle such error case.
>>
> But the sequence is incorrect for wider use case.
>>
>>> 3. Additionally any new mdev driver that wants to work on mdev device
>>> during probe() routine registered using mdev_register_driver() needs
>>> to get stable mdev structure.
>>>
>>
>> Things that you are trying to handle with mdev structure from probe(),
>> couldn't that be moved to create()?
>>
> No, as explained before and above.
> That approach just doesn't look right.
>

As I mentioned abouve, you can do that.


>>
>>> 4. In following sequence, child devices created while removing mdev
>>> parent device can be left out, or it may lead to race of removing half
>>> initialized child mdev devices.
>>>
>>> issue-1:
>>> --------
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>>                                   mdev_unregister_device()
>>>                                      device_for_each_child()
>>>                                         mdev_device_remove_cb()
>>>                                             mdev_device_remove()
>>> create_store()
>>>   mdev_device_create()                   [...]
>>>        device_register()
>>>                                   parent_remove_sysfs_files()
>>>                                   /* BUG: device added by cpu-0
>>>                                    * whose parent is getting removed.
>>>                                    */
>>>
>>> issue-2:
>>> --------
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>> create_store()
>>>   mdev_device_create()                   [...]
>>>        device_register()
>>>
>>>        [...]                      mdev_unregister_device()
>>>                                      device_for_each_child()
>>>                                         mdev_device_remove_cb()
>>>                                             mdev_device_remove()
>>>
>>>        mdev_create_sysfs_files()
>>>        /* BUG: create is adding
>>>         * sysfs files for a device
>>>         * which is undergoing removal.
>>>         */
>>>                                  parent_remove_sysfs_files()
>>>
>>> 5. Below crash is observed when user initiated remove is in progress
>>> and mdev_unregister_driver() completes parent unregistration.
>>>
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>> remove_store()
>>>    mdev_device_remove()
>>>    active = false;
>>>                                   mdev_unregister_device()
>>>                                     remove type
>>>    [...]
>>>    mdev_remove_ops() crashes.
>>>
>>> This is similar race like create() racing with mdev_unregister_device().
>>>
>>> mtty mtty: MDEV: Registered
>>> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
>>> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
>>> mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
>>> mtty_dev: Unloaded!
>>> BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
>>> af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
>>> Oops: 0000 [#1] SMP PTI
>>> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
>>> 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
>>> SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
>>> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
>>>  mdev_device_remove+0xef/0x130 [mdev]
>>>  remove_store+0x77/0xa0 [mdev]
>>>  kernfs_fop_write+0x113/0x1a0
>>>  __vfs_write+0x33/0x1b0
>>>  ? rcu_read_lock_sched_held+0x64/0x70
>>>  ? rcu_sync_lockdep_assert+0x2a/0x50
>>>  ? __sb_start_write+0x121/0x1b0
>>>  ? vfs_write+0x17c/0x1b0
>>>  vfs_write+0xad/0x1b0
>>>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>>>  ksys_write+0x55/0xc0
>>>  do_syscall_64+0x5a/0x210
>>>
>>> Therefore, mdev core is improved in following ways to overcome above
>>> issues.
>>>
>>> 1. Before placing mdev devices on the bus, perform vendor drivers
>>> creation which supports the mdev creation.
>>> This ensures that mdev specific all necessary fields are initialized
>>> before a given mdev can be accessed by bus driver.
>>>
>>> 2. During remove flow, first remove the device from the bus. This
>>> ensures that any bus specific devices and data is cleared.
>>> Once device is taken of the mdev bus, perform remove() of mdev from
>>> the vendor driver.
>>>
>>
>> If user space application is using the device and someone underneath
>> remove the device from bus, how would use space application know that
>> device is being removed?
> vfio_mdev guards and wait for device to get closed.
> 
> One sample trace is below.
> [<0>] vfio_del_group_dev+0x34a/0x3c0 [vfio]
> [<0>] mdev_remove+0x21/0x40 [mdev]
> [<0>] device_release_driver_internal+0xe8/0x1b0
> [<0>] bus_remove_device+0xf9/0x170
> [<0>] device_del+0x168/0x350
> [<0>] mdev_device_remove_common+0x1e/0x60 [mdev]
> [<0>] mdev_device_remove_cb+0x1a/0x30 [mdev]
> [<0>] device_for_each_child+0x47/0x90
> [<0>] mdev_unregister_device+0xdb/0x100 [mdev]
> [<0>] mtty_dev_exit+0x17/0x843 [mtty]
> [<0>] __x64_sys_delete_module+0x16b/0x240
> [<0>] do_syscall_64+0x5a/0x210
> [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [<0>] 0xffffffffffffffff
> 
>> If DMA is setup, user space application is accessing that memory and device
>> is removed from bus - how will you restrict to not to remove that device? If
>> remove() is not restricted then host might crash.
>> I know Linux kernel device core model doesn't allow remove() to fail, but we
>> had tackled that problem for mdev devices in this framework. I prefer not to
>> change this behavior. This will regress existing working drivers.
>>
> vfio layer ensures that open device cannot be removed from above trace.
> 
> Other drivers will follow similar method. In case of mlx5 driver which binds
> to mdev follows standard driver model to terminate for this mdev device,
> similar way for pci device.
> 

But then remove() or write on 'remove' sysfs would block, which could be
indefinite. For example in case of VM, it will block until VM is not
shutdown.
With current approach, write on 'remove' sysfs doesn't block.

Thanks,
Kirti

>>
>>> 3. Linux core device model provides way to register and auto
>>> unregister the device sysfs attribute groups at dev->groups.
>>> Make use of this groups to let core create the groups and simplify
>>> code to avoid explicit groups creation and removal.
>>>
>>> 4. Wait for any ongoing mdev create() and remove() to finish before
>>> unregistering parent device using srcu. This continues to allow
>>> multiple create and remove to progress in parallel. At the same time
>>> guard parent removal while parent is being access by create() and remove
>> callbacks.
>>>
>>
>> Agreed with this.
>> Alex already mentioned, it would be better to have separate patch for this
>> fix.
>>
> Patches are ready, I am waiting for above discussion to close before posting v1.
> 

  reply	other threads:[~2019-03-28 17:20 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-22 23:20 [PATCH 0/8] vfio/mdev: Improve vfio/mdev core module Parav Pandit
2019-03-22 23:20 ` [PATCH 1/8] vfio/mdev: Fix to not do put_device on device_register failure Parav Pandit
2019-03-25 11:48   ` Maxim Levitsky
2019-03-25 18:17   ` Kirti Wankhede
2019-03-25 19:21     ` Alex Williamson
2019-03-25 21:11       ` Parav Pandit
2019-03-22 23:20 ` [PATCH 2/8] vfio/mdev: Avoid release parent reference during error path Parav Pandit
2019-03-25 11:49   ` Maxim Levitsky
2019-03-25 18:27   ` Kirti Wankhede
2019-03-22 23:20 ` [PATCH 3/8] vfio/mdev: Removed unused kref Parav Pandit
2019-03-25 11:50   ` Maxim Levitsky
2019-03-25 18:41   ` Kirti Wankhede
2019-03-22 23:20 ` [PATCH 4/8] vfio/mdev: Drop redundant extern for exported symbols Parav Pandit
2019-03-25 11:56   ` Maxim Levitsky
2019-03-25 19:07   ` Kirti Wankhede
2019-03-25 19:49     ` Alex Williamson
2019-03-25 21:27       ` Parav Pandit
2019-03-22 23:20 ` [PATCH 5/8] vfio/mdev: Avoid masking error code to EBUSY Parav Pandit
2019-03-25 11:57   ` Maxim Levitsky
2019-03-25 19:18   ` Kirti Wankhede
2019-03-25 21:29     ` Parav Pandit
2019-03-22 23:20 ` [PATCH 6/8] vfio/mdev: Follow correct remove sequence Parav Pandit
2019-03-25 11:58   ` Maxim Levitsky
2019-03-25 20:20   ` Alex Williamson
2019-03-25 21:31     ` Parav Pandit
2019-03-22 23:20 ` [PATCH 7/8] vfio/mdev: Fix aborting mdev child device removal if one fails Parav Pandit
2019-03-25 11:58   ` Maxim Levitsky
2019-03-25 19:35   ` Kirti Wankhede
2019-03-25 20:49     ` Alex Williamson
2019-03-25 21:36       ` Parav Pandit
2019-03-25 21:52         ` Alex Williamson
2019-03-25 22:07           ` Parav Pandit
2019-03-22 23:20 ` [PATCH 8/8] vfio/mdev: Improve the create/remove sequence Parav Pandit
2019-03-25 13:24   ` Maxim Levitsky
2019-03-25 21:42     ` Parav Pandit
2019-03-25 23:18   ` Alex Williamson
2019-03-25 23:34     ` Parav Pandit
2019-03-26  0:05       ` Alex Williamson
2019-03-26  1:43         ` Parav Pandit
2019-03-26  2:16           ` Alex Williamson
2019-03-26  3:19             ` Parav Pandit
2019-03-26  5:53               ` Parav Pandit
2019-03-26 15:21                 ` Alex Williamson
2019-03-26  7:06   ` Kirti Wankhede
2019-03-26 15:26     ` Alex Williamson
2019-03-27  3:19       ` Parav Pandit
2019-03-27  3:19         ` Parav Pandit
2019-03-26 15:30     ` Parav Pandit
2019-03-28 17:20       ` Kirti Wankhede [this message]
2019-03-29 14:49         ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=69027775-7ca9-ff69-f1cc-1dc9b2001136@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cjia@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=parav@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.