kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: TCP/IP connections sometimes stop retransmitting packets (in nested virtualization case)
       [not found] <1054a24529be44e11d65e61d8760f7c59dfa073b.camel@redhat.com>
@ 2021-10-18 18:05 ` Eric Dumazet
  2021-10-18 20:49   ` Michael S. Tsirkin
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Dumazet @ 2021-10-18 18:05 UTC (permalink / raw)
  To: Maxim Levitsky, netdev
  Cc: J. Bruce Fields, kvm, qemu-devel, Paolo Bonzini, Michael Tsirkin,
	David Gilbert



On 10/17/21 3:50 AM, Maxim Levitsky wrote:
> Hi!
>  
> This is a follow up mail to my mail about NFS client deadlock I was trying to debug last week:
> https://lore.kernel.org/all/e10b46b04fe4427fa50901dda71fb5f5a26af33e.camel@redhat.com/T/#u
>  
> I strongly believe now that this is not related to NFS, but rather to some issue in networking stack and maybe
> to somewhat non standard .config I was using for the kernels which has many advanced networking options disabled
> (to cut on compile time).
> This is why I choose to start a new thread about it.
>  
> Regarding the custom .config file, in particular I disabled CONFIG_NET_SCHED and CONFIG_TCP_CONG_ADVANCED. 
> Both host and the fedora32 VM run the same kernel with those options disabled.
> 
> 
> My setup is a VM (fedora32) which runs Win10 HyperV VM inside, nested, which in turn runs a fedora32 VM
> (but I was able to reproduce it with ordinary HyperV disabled VM running in the same fedora 32 VM)
>  
> The host is running a NFS server, and the fedora32 VM runs a NFS client which is used to read/write to a qcow2 file
> which contains the disk of the nested Win10 VM. The L3 VM which windows VM optionally
> runs, is contained in the same qcow2 file.
> 
> 
> I managed to capture (using wireshark) packets around the failure in both L0 and L1.
> The trace shows fair number of lost packets, a bit more than I would expect from communication that is running on the same host, 
> but they are retransmitted and don't cause any issues until the moment of failure.
> 
> 
> The failure happens when one packet which is sent from host to the guest,
> is not received by the guest (as evident by the L1 trace, and by the following SACKS from the guest which exclude this packet), 
> and then the host (on which the NFS server runs) never attempts to re-transmit it.
> 
> 
> The host keeps on sending further TCP packets with replies to previous RPC calls it received from the fedora32 VM,
> with an increasing sequence number, as evident from both traces, and the fedora32 VM keeps on SACK'ing those received packets, 
> patiently waiting for the retransmission.
>  
> After around 12 minutes (!), the host RSTs the connection.
> 
> It is worth mentioning that while all of this is happening, the fedora32 VM can become hung if one attempts to access the files 
> on the NFS share because effectively all NFS communication is blocked on TCP level.
> 
> I attached an extract from the two traces  (in L0 and L1) around the failure up to the RST packet.
> 
> In this trace the second packet with TCP sequence number 1736557331 (first one was empty without data) is not received by the guest
> and then never retransmitted by the host.
> 
> Also worth noting that to ease on storage I captured only 512 bytes of each packet, but wireshark
> notes how many bytes were in the actual packet.
>  
> Best regards,
> 	Maxim Levitsky

TCP has special logic not attempting a retransmit if it senses the prior
packet has not been consumed yet.

Usually, the consume part is done from NIC drivers at TC completion time,
when NIC signals packet has been sent to the wire.

It seems one skb is essentially leaked somewhere, and leaked (not freed)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: TCP/IP connections sometimes stop retransmitting packets (in nested virtualization case)
  2021-10-18 18:05 ` TCP/IP connections sometimes stop retransmitting packets (in nested virtualization case) Eric Dumazet
@ 2021-10-18 20:49   ` Michael S. Tsirkin
  2021-10-18 22:12     ` Maxim Levitsky
  0 siblings, 1 reply; 3+ messages in thread
From: Michael S. Tsirkin @ 2021-10-18 20:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Maxim Levitsky, netdev, J. Bruce Fields, kvm, qemu-devel,
	Paolo Bonzini, David Gilbert

On Mon, Oct 18, 2021 at 11:05:23AM -0700, Eric Dumazet wrote:
> 
> 
> On 10/17/21 3:50 AM, Maxim Levitsky wrote:
> > Hi!
> >  
> > This is a follow up mail to my mail about NFS client deadlock I was trying to debug last week:
> > https://lore.kernel.org/all/e10b46b04fe4427fa50901dda71fb5f5a26af33e.camel@redhat.com/T/#u
> >  
> > I strongly believe now that this is not related to NFS, but rather to some issue in networking stack and maybe
> > to somewhat non standard .config I was using for the kernels which has many advanced networking options disabled
> > (to cut on compile time).
> > This is why I choose to start a new thread about it.
> >  
> > Regarding the custom .config file, in particular I disabled CONFIG_NET_SCHED and CONFIG_TCP_CONG_ADVANCED. 
> > Both host and the fedora32 VM run the same kernel with those options disabled.
> > 
> > 
> > My setup is a VM (fedora32) which runs Win10 HyperV VM inside, nested, which in turn runs a fedora32 VM
> > (but I was able to reproduce it with ordinary HyperV disabled VM running in the same fedora 32 VM)
> >  
> > The host is running a NFS server, and the fedora32 VM runs a NFS client which is used to read/write to a qcow2 file
> > which contains the disk of the nested Win10 VM. The L3 VM which windows VM optionally
> > runs, is contained in the same qcow2 file.
> > 
> > 
> > I managed to capture (using wireshark) packets around the failure in both L0 and L1.
> > The trace shows fair number of lost packets, a bit more than I would expect from communication that is running on the same host, 
> > but they are retransmitted and don't cause any issues until the moment of failure.
> > 
> > 
> > The failure happens when one packet which is sent from host to the guest,
> > is not received by the guest (as evident by the L1 trace, and by the following SACKS from the guest which exclude this packet), 
> > and then the host (on which the NFS server runs) never attempts to re-transmit it.
> > 
> > 
> > The host keeps on sending further TCP packets with replies to previous RPC calls it received from the fedora32 VM,
> > with an increasing sequence number, as evident from both traces, and the fedora32 VM keeps on SACK'ing those received packets, 
> > patiently waiting for the retransmission.
> >  
> > After around 12 minutes (!), the host RSTs the connection.
> > 
> > It is worth mentioning that while all of this is happening, the fedora32 VM can become hung if one attempts to access the files 
> > on the NFS share because effectively all NFS communication is blocked on TCP level.
> > 
> > I attached an extract from the two traces  (in L0 and L1) around the failure up to the RST packet.
> > 
> > In this trace the second packet with TCP sequence number 1736557331 (first one was empty without data) is not received by the guest
> > and then never retransmitted by the host.
> > 
> > Also worth noting that to ease on storage I captured only 512 bytes of each packet, but wireshark
> > notes how many bytes were in the actual packet.
> >  
> > Best regards,
> > 	Maxim Levitsky
> 
> TCP has special logic not attempting a retransmit if it senses the prior
> packet has not been consumed yet.
> 
> Usually, the consume part is done from NIC drivers at TC completion time,
> when NIC signals packet has been sent to the wire.
> 
> It seems one skb is essentially leaked somewhere, and leaked (not freed)

Thanks Eric!

Maxim since the packets that leak are transmitted on the host,
the question then is what kind of device do you use on the host
to talk to the guest? tap?


-- 
MST


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: TCP/IP connections sometimes stop retransmitting packets (in nested virtualization case)
  2021-10-18 20:49   ` Michael S. Tsirkin
@ 2021-10-18 22:12     ` Maxim Levitsky
  0 siblings, 0 replies; 3+ messages in thread
From: Maxim Levitsky @ 2021-10-18 22:12 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eric Dumazet
  Cc: netdev, J. Bruce Fields, kvm, qemu-devel, Paolo Bonzini, David Gilbert

On Mon, 2021-10-18 at 16:49 -0400, Michael S. Tsirkin wrote:
> On Mon, Oct 18, 2021 at 11:05:23AM -0700, Eric Dumazet wrote:
> > 
> > On 10/17/21 3:50 AM, Maxim Levitsky wrote:
> > > Hi!
> > >  
> > > This is a follow up mail to my mail about NFS client deadlock I was trying to debug last week:
> > > https://lore.kernel.org/all/e10b46b04fe4427fa50901dda71fb5f5a26af33e.camel@redhat.com/T/#u
> > >  
> > > I strongly believe now that this is not related to NFS, but rather to some issue in networking stack and maybe
> > > to somewhat non standard .config I was using for the kernels which has many advanced networking options disabled
> > > (to cut on compile time).
> > > This is why I choose to start a new thread about it.
> > >  
> > > Regarding the custom .config file, in particular I disabled CONFIG_NET_SCHED and CONFIG_TCP_CONG_ADVANCED. 
> > > Both host and the fedora32 VM run the same kernel with those options disabled.
> > > 
> > > 
> > > My setup is a VM (fedora32) which runs Win10 HyperV VM inside, nested, which in turn runs a fedora32 VM
> > > (but I was able to reproduce it with ordinary HyperV disabled VM running in the same fedora 32 VM)
> > >  
> > > The host is running a NFS server, and the fedora32 VM runs a NFS client which is used to read/write to a qcow2 file
> > > which contains the disk of the nested Win10 VM. The L3 VM which windows VM optionally
> > > runs, is contained in the same qcow2 file.
> > > 
> > > 
> > > I managed to capture (using wireshark) packets around the failure in both L0 and L1.
> > > The trace shows fair number of lost packets, a bit more than I would expect from communication that is running on the same host, 
> > > but they are retransmitted and don't cause any issues until the moment of failure.
> > > 
> > > 
> > > The failure happens when one packet which is sent from host to the guest,
> > > is not received by the guest (as evident by the L1 trace, and by the following SACKS from the guest which exclude this packet), 
> > > and then the host (on which the NFS server runs) never attempts to re-transmit it.
> > > 
> > > 
> > > The host keeps on sending further TCP packets with replies to previous RPC calls it received from the fedora32 VM,
> > > with an increasing sequence number, as evident from both traces, and the fedora32 VM keeps on SACK'ing those received packets, 
> > > patiently waiting for the retransmission.
> > >  
> > > After around 12 minutes (!), the host RSTs the connection.
> > > 
> > > It is worth mentioning that while all of this is happening, the fedora32 VM can become hung if one attempts to access the files 
> > > on the NFS share because effectively all NFS communication is blocked on TCP level.
> > > 
> > > I attached an extract from the two traces  (in L0 and L1) around the failure up to the RST packet.
> > > 
> > > In this trace the second packet with TCP sequence number 1736557331 (first one was empty without data) is not received by the guest
> > > and then never retransmitted by the host.
> > > 
> > > Also worth noting that to ease on storage I captured only 512 bytes of each packet, but wireshark
> > > notes how many bytes were in the actual packet.
> > >  
> > > Best regards,
> > > 	Maxim Levitsky
> > 
> > TCP has special logic not attempting a retransmit if it senses the prior
> > packet has not been consumed yet.
> > 
> > Usually, the consume part is done from NIC drivers at TC completion time,
> > when NIC signals packet has been sent to the wire.
> > 
> > It seems one skb is essentially leaked somewhere, and leaked (not freed)
> 
> Thanks Eric!
> 
> Maxim since the packets that leak are transmitted on the host,
> the question then is what kind of device do you use on the host
> to talk to the guest? tap?
> 
> 
Yes, tap with bridge, similiar to how libvirt does 'bridge' networking for vms.
I use my own set of scripts to run qemu directly.

Usually vhost is used in both L0 and L1, and it 'seems' to help to reproduce it,
but I did reproduced this with vhost disabled on both L0 and L1.

The capture was done on the bridge interface on L0, and on a virtual network card in L1.

It does seem that I am unable to make it fail again (maybe luck?) with CONFIG_NET_SCHED (and its suboptions)
and CONFIG_TCP_CONG_ADVANCED set back to defaults (everything 'm')

Also just to avoid going on the wrong path, note that I did once reproduce this on e1000e virtual nic,
thus virtio is likely not to blame here.


Thanks,
Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-10-18 22:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1054a24529be44e11d65e61d8760f7c59dfa073b.camel@redhat.com>
2021-10-18 18:05 ` TCP/IP connections sometimes stop retransmitting packets (in nested virtualization case) Eric Dumazet
2021-10-18 20:49   ` Michael S. Tsirkin
2021-10-18 22:12     ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).