Re: [RFC PATCH 4/9] vringh: unify the APIs for all accessors

From: Shunsuke Mie <mie@igel.co.jp>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 4/9] vringh: unify the APIs for all accessors
Date: Wed, 11 Jan 2023 13:10:15 +0900	[thread overview]
Message-ID: <28b421af-838d-e70a-ec95-2f14f21e3a90@igel.co.jp> (raw)
In-Reply-To: <20221228021354-mutt-send-email-mst@kernel.org>


On 2022/12/28 16:20, Michael S. Tsirkin wrote:
> On Wed, Dec 28, 2022 at 11:24:10AM +0900, Shunsuke Mie wrote:
>> 2022年12月27日(火) 23:37 Michael S. Tsirkin <mst@redhat.com>:
>>> On Tue, Dec 27, 2022 at 07:22:36PM +0900, Shunsuke Mie wrote:
>>>> 2022年12月27日(火) 16:49 Shunsuke Mie <mie@igel.co.jp>:
>>>>> 2022年12月27日(火) 16:04 Michael S. Tsirkin <mst@redhat.com>:
>>>>>> On Tue, Dec 27, 2022 at 11:25:26AM +0900, Shunsuke Mie wrote:
>>>>>>> Each vringh memory accessors that are for user, kern and iotlb has own
>>>>>>> interfaces that calls common code. But some codes are duplicated and that
>>>>>>> becomes loss extendability.
>>>>>>>
>>>>>>> Introduce a struct vringh_ops and provide a common APIs for all accessors.
>>>>>>> It can bee easily extended vringh code for new memory accessor and
>>>>>>> simplified a caller code.
>>>>>>>
>>>>>>> Signed-off-by: Shunsuke Mie <mie@igel.co.jp>
>>>>>>> ---
>>>>>>>   drivers/vhost/vringh.c | 667 +++++++++++------------------------------
>>>>>>>   include/linux/vringh.h | 100 +++---
>>>>>>>   2 files changed, 225 insertions(+), 542 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
>>>>>>> index aa3cd27d2384..ebfd3644a1a3 100644
>>>>>>> --- a/drivers/vhost/vringh.c
>>>>>>> +++ b/drivers/vhost/vringh.c
>>>>>>> @@ -35,15 +35,12 @@ static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
>>>>>>>   }
>>>>>>>
>>>>>>>   /* Returns vring->num if empty, -ve on error. */
>>>>>>> -static inline int __vringh_get_head(const struct vringh *vrh,
>>>>>>> -                                 int (*getu16)(const struct vringh *vrh,
>>>>>>> -                                               u16 *val, const __virtio16 *p),
>>>>>>> -                                 u16 *last_avail_idx)
>>>>>>> +static inline int __vringh_get_head(const struct vringh *vrh, u16 *last_avail_idx)
>>>>>>>   {
>>>>>>>        u16 avail_idx, i, head;
>>>>>>>        int err;
>>>>>>>
>>>>>>> -     err = getu16(vrh, &avail_idx, &vrh->vring.avail->idx);
>>>>>>> +     err = vrh->ops.getu16(vrh, &avail_idx, &vrh->vring.avail->idx);
>>>>>>>        if (err) {
>>>>>>>                vringh_bad("Failed to access avail idx at %p",
>>>>>>>                           &vrh->vring.avail->idx);
>>>>>> I like that this patch removes more lines of code than it adds.
>>>>>>
>>>>>> However one of the design points of vringh abstractions is that they were
>>>>>> carefully written to be very low overhead.
>>>>>> This is why we are passing function pointers to inline functions -
>>>>>> compiler can optimize that out.
>>>>>>
>>>>>> I think that introducing ops indirect functions calls here is going to break
>>>>>> these assumptions and hurt performance.
>>>>>> Unless compiler can somehow figure it out and optimize?
>>>>>> I don't see how it's possible with ops pointer in memory
>>>>>> but maybe I'm wrong.
>>>>> I think your concern is correct. I have to understand the compiler
>>>>> optimization and redesign this approach If it is needed.
>>>>>> Was any effort taken to test effect of these patches on performance?
>>>>> I just tested vringh_test and already faced little performance reduction.
>>>>> I have to investigate that, as you said.
>>>> I attempted to test with perf. I found that the performance of patched code
>>>> is almost the same as the upstream one. However, I have to investigate way
>>>> this patch leads to this result, also the profiling should be run on
>>>> more powerful
>>>> machines too.
>>>>
>>>> environment:
>>>> $ grep 'model name' /proc/cpuinfo
>>>> model name      : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>> model name      : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>> model name      : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>> model name      : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>>
>>>> results:
>>>> * for patched code
>>>>   Performance counter stats for 'nice -n -20 ./vringh_test_patched
>>>> --parallel --eventidx --fast-vringh --indirect --virtio-1' (20 runs):
>>>>
>>>>            3,028.05 msec task-clock                #    0.995 CPUs
>>>> utilized            ( +-  0.12% )
>>>>              78,150      context-switches          #   25.691 K/sec
>>>>                 ( +-  0.00% )
>>>>                   5      cpu-migrations            #    1.644 /sec
>>>>                 ( +-  3.33% )
>>>>                 190      page-faults               #   62.461 /sec
>>>>                 ( +-  0.41% )
>>>>       6,919,025,222      cycles                    #    2.275 GHz
>>>>                 ( +-  0.13% )
>>>>       8,990,220,160      instructions              #    1.29  insn per
>>>> cycle           ( +-  0.04% )
>>>>       1,788,326,786      branches                  #  587.899 M/sec
>>>>                 ( +-  0.05% )
>>>>           4,557,398      branch-misses             #    0.25% of all
>>>> branches          ( +-  0.43% )
>>>>
>>>>             3.04359 +- 0.00378 seconds time elapsed  ( +-  0.12% )
>>>>
>>>> * for upstream code
>>>>   Performance counter stats for 'nice -n -20 ./vringh_test_base
>>>> --parallel --eventidx --fast-vringh --indirect --virtio-1' (10 runs):
>>>>
>>>>            3,058.41 msec task-clock                #    0.999 CPUs
>>>> utilized            ( +-  0.14% )
>>>>              78,149      context-switches          #   25.545 K/sec
>>>>                 ( +-  0.00% )
>>>>                   5      cpu-migrations            #    1.634 /sec
>>>>                 ( +-  2.67% )
>>>>                 194      page-faults               #   63.414 /sec
>>>>                 ( +-  0.43% )
>>>>       6,988,713,963      cycles                    #    2.284 GHz
>>>>                 ( +-  0.14% )
>>>>       8,512,533,269      instructions              #    1.22  insn per
>>>> cycle           ( +-  0.04% )
>>>>       1,638,375,371      branches                  #  535.549 M/sec
>>>>                 ( +-  0.05% )
>>>>           4,428,866      branch-misses             #    0.27% of all
>>>> branches          ( +- 22.57% )
>>>>
>>>>             3.06085 +- 0.00420 seconds time elapsed  ( +-  0.14% )
>>>
>>> How you compiled it also matters. ATM we don't enable retpolines
>>> and it did not matter since we didn't have indirect calls,
>>> but we should. Didn't yet investigate how to do that for virtio tools.
>> I think the retpolines certainly affect performance. Thank you for pointing
>> it out. I'd like to start the investigation that how to apply the
>> retpolines to the
>> virtio tools.
>>>>> Thank you for your comments.
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>> Best,
>>>>> Shunsuke.
> This isn't all that trivial if we want this at runtime.
> But compile time is kind of easy.
> See Documentation/admin-guide/hw-vuln/spectre.rst

Thank you for showing it.


I followed the document and added options to CFLAGS to the tools Makefile.

That is

---

diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index 1b25cc7c64bb..7b7139d97d74 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -4,7 +4,7 @@ test: virtio_test vringh_test
  virtio_test: virtio_ring.o virtio_test.o
  vringh_test: vringh_test.o vringh.o virtio_ring.o

-CFLAGS += -g -O2 -Werror -Wno-maybe-uninitialized -Wall -I. 
-I../include/ -I ../../usr/include/ -Wno-pointer-sign 
-fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -include ../../include/linux/kconfig.h
+CFLAGS += -g -O2 -Werror -Wno-maybe-uninitialized -Wall -I. 
-I../include/ -I ../../usr/include/ -Wno-pointer-sign 
-fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -include ../../include/linux/kconfig.h 
-mfunction-return=thunk -fcf-protection=none -mindirect-branch-register
  CFLAGS += -pthread
  LDFLAGS += -pthread
  vpath %.c ../../drivers/virtio ../../drivers/vhost
---

And results of evaluation are following:

- base with retpoline

$ sudo perf stat --repeat 20 -- nice -n -20 ./vringh_test_retp_origin 
--parallel --eventidx --fast-vringh
Using CPUS 0 and 3
Guest: notified 0, pinged 98040
Host: notified 98040, pinged 0
...

  Performance counter stats for 'nice -n -20 ./vringh_test_retp_origin 
--parallel --eventidx --fast-vringh' (20 runs):

           6,228.33 msec task-clock                #    1.004 CPUs 
utilized            ( +-  0.05% )
            196,110      context-switches          #   31.616 
K/sec                    ( +-  0.00% )
                  6      cpu-migrations            #    0.967 
/sec                     ( +-  2.39% )
                205      page-faults               #   33.049 
/sec                     ( +-  0.46% )
     14,218,527,987      cycles                    #    2.292 
GHz                      ( +-  0.05% )
     10,342,897,254      instructions              #    0.73  insn per 
cycle           ( +-  0.02% )
      2,310,572,989      branches                  #  372.500 
M/sec                    ( +-  0.03% )
        178,273,068      branch-misses             #    7.72% of all 
branches          ( +-  0.04% )

            6.20406 +- 0.00308 seconds time elapsed  ( +-  0.05% )

- patched (unified APIs) with retpoline

$ sudo perf stat --repeat 20 -- nice -n -20 ./vringh_test_retp_patched 
--parallel --eventidx --fast-vringh
Using CPUS 0 and 3
Guest: notified 0, pinged 98040
Host: notified 98040, pinged 0
...

  Performance counter stats for 'nice -n -20 ./vringh_test_retp_patched 
--parallel --eventidx --fast-vringh' (20 runs):

           6,103.94 msec task-clock                #    1.001 CPUs 
utilized            ( +-  0.03% )
            196,125      context-switches          #   32.165 
K/sec                    ( +-  0.00% )
                  7      cpu-migrations            #    1.148 
/sec                     ( +-  1.56% )
                196      page-faults               #   32.144 
/sec                     ( +-  0.41% )
     13,933,055,778      cycles                    #    2.285 
GHz                      ( +-  0.03% )
     10,309,004,718      instructions              #    0.74  insn per 
cycle           ( +-  0.03% )
      2,368,447,519      branches                  #  388.425 
M/sec                    ( +-  0.04% )
        211,364,886      branch-misses             #    8.94% of all 
branches          ( +-  0.05% )

            6.09888 +- 0.00155 seconds time elapsed  ( +-  0.03% )

As a result, at the patched code, the branch-misses was increased but
elapsed time became faster than the based code. The number of 
page-faults was
a little different. I'm suspicious of that the page-fault penalty leads the
performance result.

I think that a pattern of memory access for data is same with those, but
for instruction is different. Actually a code size (.text segment) was a
little smaller. 0x6a65 and 0x63f5.

$ readelf -a ./vringh_test_retp_origin |grep .text -1
        0000000000000008  0000000000000008  AX       0     0     8
   [14] .text             PROGBITS         0000000000001230 00001230
        0000000000006a65  0000000000000000  AX       0     0     16
--
    02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym 
.dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
    03     .init .plt .plt.got .text .fini
    04     .rodata .eh_frame_hdr .eh_frame


$ readelf -a ./vringh_test_retp_patched |grep .text -1
        0000000000000008  0000000000000008  AX       0     0     8
   [14] .text             PROGBITS         0000000000001230 00001230
        00000000000063f5  0000000000000000  AX       0     0     16
--
    02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym 
.dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
    03     .init .plt .plt.got .text .fini
    04     .rodata .eh_frame_hdr .eh_frame

I'll keep this investigation. I was wondering if you could comment me.

Best

>