All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
To: Jason Wang <jasowang@redhat.com>, Wei Xu <wexu@redhat.com>
Cc: mst@redhat.com, netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: Regression in throughput between kvm guests over virtual bridge
Date: Mon, 27 Nov 2017 21:44:07 -0500	[thread overview]
Message-ID: <edb28fe5-cedb-8e63-88b2-122d3dfe3014@linux.vnet.ibm.com> (raw)
In-Reply-To: <bcd4051d-5573-0841-a86b-8fccf03931c9@redhat.com>

On 11/27/2017 08:36 PM, Jason Wang wrote:
> 
> 
> On 2017年11月28日 00:21, Wei Xu wrote:
>> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
>>> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
>>>> On 11/12/2017 01:34 PM, Wei Xu wrote:
>>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
>>>>>>>> This case should be quite similar with pkgten, if you got
>>>>>>>> improvement with
>>>>>>>> pktgen, usually it was also the same for UDP, could you please
>>>>>>>> try to disable
>>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net
>>>>>>>> devices? Currently
>>>>>>>> the most significant tests would be like this AFAICT:
>>>>>>>>
>>>>>>>> Host->VM     4.12    4.13
>>>>>>>>   TCP:
>>>>>>>>   UDP:
>>>>>>>> pktgen:
>>> So, I automated these scenarios for extended overnight runs and started
>>> experiencing OOM conditions overnight on a 40G system.  I did a bisect
>>> and it also points to c67df11f.  I can see a leak in at least all of the
>>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
>>> fastest leak.
>>>
>>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
>>> intervals until a large% of host memory was consumed.  Numbers below
>>> after the last pktgen run completed. The summary is that a very large #
>>> of active skbuff_head_cache entries can be seen - The sum of alloc/free
>>> calls match up, but the # of active skbuff_head_cache entries keeps
>>> growing each time the workload is run and never goes back down in
>>> between runs.
>>>
>>> free -h:
>>>       total        used        free      shared  buff/cache   available
>>> Mem:   39G         31G        6.6G        472K        1.4G        6.8G
>>>
>>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>
>>> 1001952 1000610  99%    0.75K  23856       42    763392K
>>> skbuff_head_cache
>>> 126192 126153  99%    0.36K   2868     44     45888K ksm_rmap_item
>>> 100485 100435  99%    0.41K   1305     77     41760K kernfs_node_cache
>>>   63294  39598  62%    0.48K    959     66     30688K dentry
>>>   31968  31719  99%    0.88K    888     36     28416K inode_cache
>>>
>>> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
>>>      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776
>>> cpus=0,2,4,18
>>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863
>>> cpus=0,10
>>>
>>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>>    13492 <not-available> age=4295073614 pid=0 cpus=0
>>>   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
>>> cpus=1-19
>>>        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
>>> cpus=4,8,10,12,14
>>>        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
>>> pid=0-11605 cpus=5,7,12
>>>        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
>>>        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325
>>> cpus=4,12
>>>        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
>>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
>>>        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
>>> pid=9915-11581 cpus=8,16,18
>>>        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
>>> pid=11605-11699 cpus=2,9
>>>        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
>>> pid=331 cpus=11
>>>     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen]
>>> age=8545/62184/110571
>>> pid=11863 cpus=0
>>>
>>>
>>> By comparison, when running 4.13 with c67df11f reverted, here's the same
>>> output after the exact same test:
>>>
>>> free -h:
>>>         total        used        free      shared  buff/cache  
>>> available
>>> Mem:     39G        783M         37G        472K        637M         37G
>>>
>>> slabtop:
>>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>     714    256  35%    0.75K     17     42      544K skbuff_head_cache
>>>
>>> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
>>>      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
>>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>>      255 <not-available> age=4295003081 pid=0 cpus=0
>>>        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
>>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
>>>
>> Thanks a lot for the test, and sorry for the late update, I was
>> working on
>> the code path and didn't find anything helpful to you till today.
>>
>> I did some tests and initially it turned out that the bottleneck was
>> the guest
>> kernel stack(napi) side, followed by tracking the traffic footprints
>> and it
>> appeared as the loss happened when vring was full and could not be
>> drained
>> out by the guest, afterwards it triggered a SKB drop in vhost driver due
>> to no headcount to fill it with, it can be avoided by deferring
>> consuming the
>> SKB after having obtained a sufficient headcount with below patch.
>>
>> Could you please try it? It is based on 4.13 and I also applied Jason's
>> 'conditionally enable tx polling' patch.
>>      https://lkml.org/lkml/2016/6/1/39
> 
> This patch has already been merged.
> 
>>
>> I only tested one instance case from Host -> VM with uperf & iperf3, I
>> like
>> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
>> during testing. :)
>>
>> To maximize the performance of one instance case, two vcpus are needed,
>> one does the kernel napi and the other one should serve the socket
>> syscall
>> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the
>> guest
>> and pinned the iperf/uperf slave to the one not used by kernel napi,
>> you may
>> need to check out which one you should pin properly by seeing the CPU
>> utilization with a quick trial test before running the long duration
>> test.
>>
>> Slight performance improvement for tcp with the patch(host/guest
>> offload off)
>> on x86, also 4.12 wins the game with 20-30% possibility from time to
>> time, but
>> the cwnd and retransmitted statistics are almost the same now, the
>> 'retrans'
>> was about 10x times more and cwnd was 6x smaller than 4.12 before.
>>
>> Here is one typical sample of my tests.
>>                  4.12          4.13
>> offload on:   36.8Gbits     37.4Gbits
>> offload off:  7.68Gbits     7.84Gbits
>>
>> I also borrowed a s390x machine with 6 cpus and 4G memory from system
>> z team,
>> it seems 4.12 is still a bit faster than 4.13, could you please see if
>> this
>> is aligned with your test bed?
>>                  4.12          4.13
>> offload on:   37.3Gbits     38.3Gbits
>> offload off:  6.26Gbits     6.06Gbits
>>
>> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit
>> faster
>> than Jason's number before.
>>                  4.12          4.13
>>                3.33 Mpss     3.70 Mpps
>>
>> Thanks again for all the tests your have done.
>>
>> Wei
>>
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
>>                  /* On error, stop handling until the next kick. */
>>                  if (unlikely(headcount < 0))
>>                          goto out;
>> -               if (nvq->rx_array)
>> -                       msg.msg_control =
>> vhost_net_buf_consume(&nvq->rxq);
>>                  /* On overrun, truncate and discard */
>>                  if (unlikely(headcount > UIO_MAXIOV)) {
> 
> I think you need do msg.msg_control = vhost_net_buf_consume() here too.
> 
>>                          iov_iter_init(&msg.msg_iter, READ, vq->iov,
>> 1, 1);
>> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
>>                           * they refilled. */
>>                          goto out;
>>                  }
>> +
>> +               if (nvq->rx_array)
>> +                       msg.msg_control =
>> vhost_net_buf_consume(&nvq->rxq);
>> +
>>                  /* We don't need to be notified again. */
>>                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in,
>> vhost_len);
>>                  fixup = msg.msg_iter;
>>
>>
> 
> Good catch, this fixes the memory leak too.
> 
> I suggest to post a formal patch for -net as soon as possible too since
> it was a valid fix even if it does not help for performance.
>> Thanks
> 

+1 to posting this patch formally.  I also verified that it resolves the
memory leak I was experiencing.

In terms of performance numbers, here are quick #s using the original
environment where the regression was noted (4GB, 4vcpu guests, no CPU
binding, TCP VM<->VM):

4.12:	34.71Gb/s
4.13:	18.80Gb/s
4.13+:	38.26Gb/s

I'll keep running numbers, but that looks very promising.

  reply	other threads:[~2017-11-28  2:44 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-12 17:56 Regression in throughput between kvm guests over virtual bridge Matthew Rosato
2017-09-13  1:16 ` Jason Wang
2017-09-13  8:13   ` Jason Wang
2017-09-13 16:59     ` Matthew Rosato
2017-09-14  4:21       ` Jason Wang
2017-09-15  3:36         ` Matthew Rosato
2017-09-15  8:55           ` Jason Wang
2017-09-15 19:19             ` Matthew Rosato
2017-09-18  3:13               ` Jason Wang
2017-09-18  4:14                 ` [PATCH] vhost_net: conditionally enable tx polling kbuild test robot
2017-09-18  7:36                 ` Regression in throughput between kvm guests over virtual bridge Jason Wang
2017-09-18 18:11                   ` Matthew Rosato
2017-09-20  6:27                     ` Jason Wang
2017-09-20 19:38                       ` Matthew Rosato
2017-09-22  4:03                         ` Jason Wang
2017-09-25 20:18                           ` Matthew Rosato
2017-10-05 20:07                             ` Matthew Rosato
2017-10-11  2:41                               ` Jason Wang
2017-10-12 18:31                               ` Wei Xu
2017-10-18 20:17                                 ` Matthew Rosato
2017-10-23  2:06                                   ` Jason Wang
2017-10-23  2:13                                     ` Michael S. Tsirkin
2017-10-25 20:21                                     ` Matthew Rosato
2017-10-26  9:44                                       ` Wei Xu
2017-10-26 17:53                                         ` Matthew Rosato
2017-10-31  7:07                                           ` Wei Xu
2017-10-31  7:00                                             ` Jason Wang
2017-11-03  4:30                                             ` Matthew Rosato
2017-11-04 23:35                                               ` Wei Xu
2017-11-08  1:02                                                 ` Matthew Rosato
2017-11-11 20:59                                                   ` Matthew Rosato
2017-11-12 18:34                                                     ` Wei Xu
2017-11-14 20:11                                                       ` Matthew Rosato
2017-11-20 19:25                                                         ` Matthew Rosato
2017-11-27 16:21                                                           ` Wei Xu
2017-11-28  1:36                                                             ` Jason Wang
2017-11-28  2:44                                                               ` Matthew Rosato [this message]
2017-11-28 18:00                                                                 ` Wei Xu
2017-11-28  3:51                                                               ` Wei Xu
2017-11-12 15:40                                                   ` Wei Xu
2017-10-23 13:57                                   ` Wei Xu
2017-10-25 20:31                                     ` Matthew Rosato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=edb28fe5-cedb-8e63-88b2-122d3dfe3014@linux.vnet.ibm.com \
    --to=mjrosato@linux.vnet.ibm.com \
    --cc=davem@davemloft.net \
    --cc=jasowang@redhat.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=wexu@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.