All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v2] virtio-net: fix possible dim status unrecoverable
@ 2024-03-26  6:25 Heng Qi
  2024-03-28 10:34 ` Paolo Abeni
  0 siblings, 1 reply; 3+ messages in thread
From: Heng Qi @ 2024-03-26  6:25 UTC (permalink / raw)
  To: netdev, virtualization
  Cc: Jason Wang, Michael S. Tsirkin, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, David S. Miller, Xuan Zhuo

When the dim worker is scheduled, if it fails to acquire the lock,
dim may not be able to return to the working state later.

For example, the following single queue scenario:
  1. The dim worker of rxq0 is scheduled, and the dim status is
     changed to DIM_APPLY_NEW_PROFILE;
  2. The ethtool command is holding rtnl lock;
  3. Since the rtnl lock is already held, virtnet_rx_dim_work fails
     to acquire the lock and exits;

Then, even if net_dim is invoked again, it cannot work because the
state is not restored to DIM_START_MEASURE.

Patch has been tested on a VM with 16 NICs, 128 queues per NIC
(2kq total):
With dim enabled on all queues, there are many opportunities for
contention for RTNL lock, and this patch introduces no visible hotspots.
The dim performance is also stable.

Fixes: 6208799553a8 ("virtio-net: support rx netdim")
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
v1->v2:
  - Update commit log. No functional changes.

 drivers/net/virtio_net.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d111..0ebe322 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3563,8 +3563,10 @@ static void virtnet_rx_dim_work(struct work_struct *work)
 	struct dim_cq_moder update_moder;
 	int i, qnum, err;
 
-	if (!rtnl_trylock())
+	if (!rtnl_trylock()) {
+		schedule_work(&dim->work);
 		return;
+	}
 
 	/* Each rxq's work is queued by "net_dim()->schedule_work()"
 	 * in response to NAPI traffic changes. Note that dim->profile_ix
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net v2] virtio-net: fix possible dim status unrecoverable
  2024-03-26  6:25 [PATCH net v2] virtio-net: fix possible dim status unrecoverable Heng Qi
@ 2024-03-28 10:34 ` Paolo Abeni
  2024-03-29  2:19   ` Heng Qi
  0 siblings, 1 reply; 3+ messages in thread
From: Paolo Abeni @ 2024-03-28 10:34 UTC (permalink / raw)
  To: Heng Qi, netdev, virtualization
  Cc: Jason Wang, Michael S. Tsirkin, Jakub Kicinski, Eric Dumazet,
	David S. Miller, Xuan Zhuo

On Tue, 2024-03-26 at 14:25 +0800, Heng Qi wrote:
> When the dim worker is scheduled, if it fails to acquire the lock,
> dim may not be able to return to the working state later.
> 
> For example, the following single queue scenario:
>   1. The dim worker of rxq0 is scheduled, and the dim status is
>      changed to DIM_APPLY_NEW_PROFILE;
>   2. The ethtool command is holding rtnl lock;
>   3. Since the rtnl lock is already held, virtnet_rx_dim_work fails
>      to acquire the lock and exits;
> 
> Then, even if net_dim is invoked again, it cannot work because the
> state is not restored to DIM_START_MEASURE.
> 
> Patch has been tested on a VM with 16 NICs, 128 queues per NIC
> (2kq total):
> With dim enabled on all queues, there are many opportunities for
> contention for RTNL lock, and this patch introduces no visible hotspots.
> The dim performance is also stable.
> 
> Fixes: 6208799553a8 ("virtio-net: support rx netdim")
> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
> Acked-by: Jason Wang <jasowang@redhat.com>
> ---
> v1->v2:
>   - Update commit log. No functional changes.
> 
>  drivers/net/virtio_net.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d111..0ebe322 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3563,8 +3563,10 @@ static void virtnet_rx_dim_work(struct work_struct *work)
>  	struct dim_cq_moder update_moder;
>  	int i, qnum, err;
>  
> -	if (!rtnl_trylock())
> +	if (!rtnl_trylock()) {
> +		schedule_work(&dim->work);
>  		return;

I'm really scared by this change. VMs are (increasingly) used to run
containers orchestration, which in turns puts a lot of pressure on the
RTNL lock. Any rtnl_trylock+ reschedule may hang for a very long time.
Addressing this kind of issues later becomes _extremely_ painful, see:

https://lore.kernel.org/netdev/20231018154804.420823-1-atenart@kernel.org/

I really think a different solution is needed. What about moving
virtnet_send_command() under protection of a new mutex?

I understand it will complicate future hardening works around cvq, but
really rtnl_trylock()/<spin/retry> is bad for the whole system.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net v2] virtio-net: fix possible dim status unrecoverable
  2024-03-28 10:34 ` Paolo Abeni
@ 2024-03-29  2:19   ` Heng Qi
  0 siblings, 0 replies; 3+ messages in thread
From: Heng Qi @ 2024-03-29  2:19 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jason Wang, Michael S. Tsirkin, Jakub Kicinski, Eric Dumazet,
	David S. Miller, Xuan Zhuo, Daniel Jurgens,
	open list:NETWORKING DRIVERS,
	open list:VIRTIO CORE AND NET DRIVERS



在 2024/3/28 下午6:34, Paolo Abeni 写道:
> On Tue, 2024-03-26 at 14:25 +0800, Heng Qi wrote:
>> When the dim worker is scheduled, if it fails to acquire the lock,
>> dim may not be able to return to the working state later.
>>
>> For example, the following single queue scenario:
>>    1. The dim worker of rxq0 is scheduled, and the dim status is
>>       changed to DIM_APPLY_NEW_PROFILE;
>>    2. The ethtool command is holding rtnl lock;
>>    3. Since the rtnl lock is already held, virtnet_rx_dim_work fails
>>       to acquire the lock and exits;
>>
>> Then, even if net_dim is invoked again, it cannot work because the
>> state is not restored to DIM_START_MEASURE.
>>
>> Patch has been tested on a VM with 16 NICs, 128 queues per NIC
>> (2kq total):
>> With dim enabled on all queues, there are many opportunities for
>> contention for RTNL lock, and this patch introduces no visible hotspots.
>> The dim performance is also stable.
>>
>> Fixes: 6208799553a8 ("virtio-net: support rx netdim")
>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>> Acked-by: Jason Wang <jasowang@redhat.com>
>> ---
>> v1->v2:
>>    - Update commit log. No functional changes.
>>
>>   drivers/net/virtio_net.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index c22d111..0ebe322 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -3563,8 +3563,10 @@ static void virtnet_rx_dim_work(struct work_struct *work)
>>   	struct dim_cq_moder update_moder;
>>   	int i, qnum, err;
>>   
>> -	if (!rtnl_trylock())
>> +	if (!rtnl_trylock()) {
>> +		schedule_work(&dim->work);
>>   		return;
> I'm really scared by this change. VMs are (increasingly) used to run
> containers orchestration, which in turns puts a lot of pressure on the
> RTNL lock. Any rtnl_trylock+ reschedule may hang for a very long time.
> Addressing this kind of issues later becomes _extremely_ painful, see:
>
> https://lore.kernel.org/netdev/20231018154804.420823-1-atenart@kernel.org/
>
> I really think a different solution is needed. What about moving
> virtnet_send_command() under protection of a new mutex?

Daniel did additional work:

https://lore.kernel.org/all/20240328044715.266641-1-danielj@nvidia.com/

Use spin lock to protect ctrlq access, therefore, rtnl lock can be 
removed in rx_dim_work,
which will make the problem non-existent.

Thanks,
Heng

>
> I understand it will complicate future hardening works around cvq, but
> really rtnl_trylock()/<spin/retry> is bad for the whole system.
>
> Cheers,
>
> Paolo


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-03-29  2:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-26  6:25 [PATCH net v2] virtio-net: fix possible dim status unrecoverable Heng Qi
2024-03-28 10:34 ` Paolo Abeni
2024-03-29  2:19   ` Heng Qi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.