All of lore.kernel.org
 help / color / mirror / Atom feed
* TB Cache size grows out of control with qemu 5.0
@ 2020-07-15 14:29 Christian Ehrhardt
  2020-07-15 15:58 ` BALATON Zoltan
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Ehrhardt @ 2020-07-15 14:29 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5510 bytes --]

Hi,
Since qemu 5.0 I found that Ubuntu Test environments crashed often.
After a while I found that all affected tests run qemu TCG and
they get into OOM conditions that kills the qemu process.

Steps to reproduce:
Run TCG on a guest image until boot settles:
  $ wget
http://cloud-images.ubuntu.com/daily/server/groovy/20200714/groovy-server-cloudimg-amd64.img
  $ qemu-system-x86_64 -name guest=groovy-testguest,debug-threads=on
-machine pc-q35-focal,accel=tcg,usb=off,dump-guest-core=off -cpu qemu64 -m
512 -overcommit mem-lock=off -smp 1,sockets=1,cores=1,threads=1 -hda
groovy-server-cloudimg-amd64.img -nographic -serial file:/tmp/serial.debug
I usually wait until I see no faults anymore in pidstat (indicates bootup is
complete). Also at that point the worker threads vanish or at least reduce
significantly.
Then I checked RSS sizes:
  $ pidstat -p $(pgrep -f 'name guest=groovy-testguest') -T ALL -rt 5

To know the expected deviations I ran it a few times with old/new version

       qemu 4.2           qemu 5.0
    VSZ     RSS        VSZ     RSS
1735828  642172    2322668 1641532
1424596  602368    2374068 1628788
1570060  611372    2789648 1676748
1556696  611240    2981112 1658196
1388844  649696    2443716 1636896
1597788  644584    2989336 1635516

That is ~+160%

I was wondering if that might be the new toolchain or due to features
instead of TCG (even though all non TCG tests showed no issue).
I ran the same with -enable-kvm which shows no difference worth to report:

accel=kvm Old qemu:   accel=kvm New qemu:
    VSZ     RSS           VSZ     RSS
1844232  489224       1195880  447696
1068784  448324       1330036  484464
1583020  448708       1380408  468588
1247244  493980       1244148  493188
1702912  483444       1247672  454444
1287980  448480       1983548  501184

So it seems to come down to "4.2 TCG vs 5.0 TCG"
Therefore I have spun up a 4.2 and a 5.0 qemu with TCG showing this ~+160%
increased memory consumption.
Using smem I then check where the consumption was per mapping:

# smem --reverse --mappings --abbreviate --processfilter=qemu-system-x86_64
| head -n 10
                           qemu 4.2          qemu 5.0
Map                 AVGPSS      PSS   AVGPSS      PSS
<anonymous>         289.5M   579.0M   811.5M     1.6G
qemu-system-x86_64    9.1M     9.1M     9.2M     9.2M
[heap]                2.8M     5.6M     3.4M     6.8M
/usr/bin/python3.8    1.8M     1.8M     1.8M     1.8M
/libepoxy.so.0.     448.0K   448.0K   448.0K   448.0K
/libcrypto.so.1     296.0K   296.0K   275.0K   275.0K
/libgnutls.so.3     234.0K   234.0K   230.0K   230.0K
/libasound.so.2     208.0K   208.0K   208.0K   208.0K
/libssl.so.1.1      180.0K   180.0K    92.0K   184.0K

So all the increase is in anon memory of qemu.

Since it is TCG I ran `info jit` in the Monitor

qemu 4.2:
(qemu) info jit
Translation buffer state:
gen code size       99.781.715/134.212.563
TB count            183622
TB avg target size  18 max=1992 bytes
TB avg host size    303 bytes (expansion ratio: 16.4)
cross page TB count 797 (0%)
direct jump count   127941 (69%) (2 jumps=91451 49%)
TB hash buckets     98697/131072 (75.30% head buckets used)
TB hash occupancy   34.04% avg chain occ. Histogram: [0,10)%|▆ █
 ▅▁▃▁▁|[90,100]%
TB hash avg chain   1.020 buckets. Histogram: 1|█▁▁|3

Statistics:
TB flush count      14
TB invalidate count 92226
TLB full flushes    1
TLB partial flushes 175405
TLB elided flushes  233747
[TCG profiler not compiled]

qemu 5.0:
(qemu) info jit
Translation buffer state:
gen code size       259.896.403/1.073.736.659
TB count            456365
TB avg target size  20 max=1992 bytes
TB avg host size    328 bytes (expansion ratio: 16.1)
cross page TB count 2020 (0%)
direct jump count   309815 (67%) (2 jumps=227122 49%)
TB hash buckets     216220/262144 (82.48% head buckets used)
TB hash occupancy   41.36% avg chain occ. Histogram: [0,10)%|▅ █
 ▇▁▄▁▂|[90,100]%
TB hash avg chain   1.039 buckets. Histogram: 1|█▁▁|3

Statistics:
TB flush count      1
TB invalidate count 463653
TLB full flushes    0
TLB partial flushes 178464
TLB elided flushes  242382
[TCG profiler not compiled]

Well I see the numbers increase, but this isn't my home turf anymore.

The one related tunabel I know is -tb-size I ran both versions with
  -tb-size 150
And the result gave me two similarly working qemu processes.
RSS
qemu 4.2: 628072 635528
qemu 5.0: 655628 634952

Seems to be "good again" with that tunable set.

It seems the TB default sizing, self size reduction or something like
it cache has changed. On a system with ~1.5G for example (which matches our
testbeds) I'd expect it to back down a bit before being OOM Killed consuming
almost 100% of the memory.

My next step is to build qemu from source without an Ubuntu downstream
delta.
That should help to further track it down and also provide some results of
the soon to be released 5.1. That will probably take until tomorrow,
I'll report here again then.

I searched the mailing list and the web for this behavior, but either I use
the wrong keywords or it wasn't reported/discussed yet.
Nor does [1] list anything that sounds related
But if this already rings a bell for someone please let me know.
Thanks in advance!

[1]: https://wiki.qemu.org/ChangeLog/5.0#TCG

-- 
Christian Ehrhardt
Staff Engineer, Ubuntu Server
Canonical Ltd

[-- Attachment #2: Type: text/html, Size: 6327 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TB Cache size grows out of control with qemu 5.0
  2020-07-15 14:29 TB Cache size grows out of control with qemu 5.0 Christian Ehrhardt
@ 2020-07-15 15:58 ` BALATON Zoltan
  2020-07-16 13:25   ` Christian Ehrhardt
  0 siblings, 1 reply; 7+ messages in thread
From: BALATON Zoltan @ 2020-07-15 15:58 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: qemu-devel

See commit 47a2def4533a2807e48954abd50b32ecb1aaf29a and the next two 
following it.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TB Cache size grows out of control with qemu 5.0
  2020-07-15 15:58 ` BALATON Zoltan
@ 2020-07-16 13:25   ` Christian Ehrhardt
  2020-07-16 16:27     ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Ehrhardt @ 2020-07-16 13:25 UTC (permalink / raw)
  To: BALATON Zoltan, Alex Bennée, Richard Henderson
  Cc: Paolo Bonzini, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3326 bytes --]

On Wed, Jul 15, 2020 at 5:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:

> See commit 47a2def4533a2807e48954abd50b32ecb1aaf29a and the next two
> following it.
>

Thank you Zoltan for pointing out this commit, I agree that this seems to be
the trigger for the issues I'm seeing. Unfortunately the common CI host size
is 1-2G. For example on Ubuntu Autopkgtests 1.5G.
Those of them running guests do so in 0.5-1G size in TCG mode
(as they often can't rely on having KVM available).

The 1G TB buffer + 0.5G actual guest size + lack of dynamic downsizing
on memory pressure (never existed) makes these systems go OOM-Killing
the qemu process.

The patches indicated that the TB flushes on a full guest boot are a
good indicator of the TB size efficiency. From my old checks I had:

- Qemu 4.2 512M guest with 32M default overwritten by ram-size/4
TB flush count      14, 14, 16
- Qemu 5.0 512M guest with 1G default
TB flush count      1, 1, 1

I agree that ram/4 seems odd, especially on huge guests that is a lot
potentially wasted. And most environments have a bit of breathing
room 1G is too big in small host systems and the common CI system falls
into this category. So I tuned it down to 256M for a test.

- Qemu 4.2 512M guest with tb-size 256M
TB flush count      5, 5, 5
- Qemu 5.0 512M guest with tb-size 256M
TB flush count      5, 5, 5
- Qemu 5.0 512M guest with 256M default in code
TB flush count      5, 5, 5

So performance wise the results are as much in-between as you'd think from a
TB size in between. And the memory consumption which (for me) is the actual
current issue to fix would be back in line again as expected.

So on one hand I'm suggesting something like:
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -944,7 +944,7 @@ static void page_lock_pair(PageDesc **re
  * Users running large scale system emulation may want to tweak their
  * runtime setup via the tb-size control on the command line.
  */
-#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (1 * GiB)
+#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (256 * MiB)
 #endif
 #endif

OTOH I understand someone else might want to get the more speedy 1G
especially for large guests. If someone used to run a 4G guest in TCG the
TB Size was 1G all along.
How about picking the smaller of (1G || ram-size/4) as default?

This might then look like:
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -956,7 +956,12 @@ static inline size_t size_code_gen_buffe
 {
     /* Size the buffer.  */
     if (tb_size == 0) {
-        tb_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
+        unsigned long max_default = (unsigned long)(ram_size / 4);
+        if (max_default < DEFAULT_CODE_GEN_BUFFER_SIZE) {
+            tb_size = max_default;
+        } else {
+           tb_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
+        }
     }
     if (tb_size < MIN_CODE_GEN_BUFFER_SIZE) {
         tb_size = MIN_CODE_GEN_BUFFER_SIZE;

This is a bit more tricky than it seems as ram_sizes is no more
present in that context but it is enough to discuss it.
That should serve all cases - small and large - better as a pure
static default of 1G or always ram/4?

P.S. I added Alex being the Author of the offending patch and Richard/Paolo
for being listed in the Maintainers file for TCG.

-- 
Christian Ehrhardt
Staff Engineer, Ubuntu Server
Canonical Ltd

[-- Attachment #2: Type: text/html, Size: 4065 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TB Cache size grows out of control with qemu 5.0
  2020-07-16 13:25   ` Christian Ehrhardt
@ 2020-07-16 16:27     ` Alex Bennée
  2020-07-16 22:35       ` BALATON Zoltan
  2020-07-17 13:40       ` Christian Ehrhardt
  0 siblings, 2 replies; 7+ messages in thread
From: Alex Bennée @ 2020-07-16 16:27 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Paolo Bonzini, Richard Henderson, qemu-devel


Christian Ehrhardt <christian.ehrhardt@canonical.com> writes:

> On Wed, Jul 15, 2020 at 5:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>
>> See commit 47a2def4533a2807e48954abd50b32ecb1aaf29a and the next two
>> following it.
>>
>
> Thank you Zoltan for pointing out this commit, I agree that this seems to be
> the trigger for the issues I'm seeing. Unfortunately the common CI host size
> is 1-2G. For example on Ubuntu Autopkgtests 1.5G.
> Those of them running guests do so in 0.5-1G size in TCG mode
> (as they often can't rely on having KVM available).
>
> The 1G TB buffer + 0.5G actual guest size + lack of dynamic downsizing
> on memory pressure (never existed) makes these systems go OOM-Killing
> the qemu process.

Ooops. I admit the assumption was that most people running system
emulation would be doing it on beefier machines.

> The patches indicated that the TB flushes on a full guest boot are a
> good indicator of the TB size efficiency. From my old checks I had:
>
> - Qemu 4.2 512M guest with 32M default overwritten by ram-size/4
> TB flush count      14, 14, 16
> - Qemu 5.0 512M guest with 1G default
> TB flush count      1, 1, 1
>
> I agree that ram/4 seems odd, especially on huge guests that is a lot
> potentially wasted. And most environments have a bit of breathing
> room 1G is too big in small host systems and the common CI system falls
> into this category. So I tuned it down to 256M for a test.
>
> - Qemu 4.2 512M guest with tb-size 256M
> TB flush count      5, 5, 5
> - Qemu 5.0 512M guest with tb-size 256M
> TB flush count      5, 5, 5
> - Qemu 5.0 512M guest with 256M default in code
> TB flush count      5, 5, 5
>
> So performance wise the results are as much in-between as you'd think from a
> TB size in between. And the memory consumption which (for me) is the actual
> current issue to fix would be back in line again as expected.

So I'm glad you have the workaround. 

>
> So on one hand I'm suggesting something like:
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -944,7 +944,7 @@ static void page_lock_pair(PageDesc **re
>   * Users running large scale system emulation may want to tweak their
>   * runtime setup via the tb-size control on the command line.
>   */
> -#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (1 * GiB)
> +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (256 * MiB)

The problem we have is any number we pick here is arbitrary. And while
it did regress your use-case changing it again just pushes a performance
regression onto someone else. The most (*) 64 bit desktop PCs have 16Gb
of RAM, almost all have more than 8gb. And there is a workaround.

* random number from Steams HW survey.

>  #endif
>  #endif
>
> OTOH I understand someone else might want to get the more speedy 1G
> especially for large guests. If someone used to run a 4G guest in TCG the
> TB Size was 1G all along.
> How about picking the smaller of (1G || ram-size/4) as default?
>
> This might then look like:
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -956,7 +956,12 @@ static inline size_t size_code_gen_buffe
>  {
>      /* Size the buffer.  */
>      if (tb_size == 0) {
> -        tb_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
> +        unsigned long max_default = (unsigned long)(ram_size / 4);
> +        if (max_default < DEFAULT_CODE_GEN_BUFFER_SIZE) {
> +            tb_size = max_default;
> +        } else {
> +           tb_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
> +        }
>      }
>      if (tb_size < MIN_CODE_GEN_BUFFER_SIZE) {
>          tb_size = MIN_CODE_GEN_BUFFER_SIZE;
>
> This is a bit more tricky than it seems as ram_sizes is no more
> present in that context but it is enough to discuss it.
> That should serve all cases - small and large - better as a pure
> static default of 1G or always ram/4?

I'm definitely against re-introducing ram_size into the mix. The
original commit (a1b18df9a4) that broke this introduced an ordering
dependency which we don't want to bring back.

I'd be more amenable to something that took into account host memory and
limited the default if it was smaller than a threshold. Is there a way
to probe that that doesn't involve slurping /proc/meminfo?

>
> P.S. I added Alex being the Author of the offending patch and Richard/Paolo
> for being listed in the Maintainers file for TCG.


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TB Cache size grows out of control with qemu 5.0
  2020-07-16 16:27     ` Alex Bennée
@ 2020-07-16 22:35       ` BALATON Zoltan
  2020-07-17 13:40       ` Christian Ehrhardt
  1 sibling, 0 replies; 7+ messages in thread
From: BALATON Zoltan @ 2020-07-16 22:35 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Paolo Bonzini, qemu-devel, Christian Ehrhardt, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 4796 bytes --]

On Thu, 16 Jul 2020, Alex Bennée wrote:
> Christian Ehrhardt <christian.ehrhardt@canonical.com> writes:
>> On Wed, Jul 15, 2020 at 5:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>> See commit 47a2def4533a2807e48954abd50b32ecb1aaf29a and the next two
>>> following it.
>>
>> Thank you Zoltan for pointing out this commit, I agree that this seems to be
>> the trigger for the issues I'm seeing. Unfortunately the common CI host size
>> is 1-2G. For example on Ubuntu Autopkgtests 1.5G.
>> Those of them running guests do so in 0.5-1G size in TCG mode
>> (as they often can't rely on having KVM available).
>>
>> The 1G TB buffer + 0.5G actual guest size + lack of dynamic downsizing
>> on memory pressure (never existed) makes these systems go OOM-Killing
>> the qemu process.
>
> Ooops. I admit the assumption was that most people running system
> emulation would be doing it on beefier machines.
>
>> The patches indicated that the TB flushes on a full guest boot are a
>> good indicator of the TB size efficiency. From my old checks I had:
>>
>> - Qemu 4.2 512M guest with 32M default overwritten by ram-size/4
>> TB flush count      14, 14, 16
>> - Qemu 5.0 512M guest with 1G default
>> TB flush count      1, 1, 1
>>
>> I agree that ram/4 seems odd, especially on huge guests that is a lot
>> potentially wasted. And most environments have a bit of breathing
>> room 1G is too big in small host systems and the common CI system falls
>> into this category. So I tuned it down to 256M for a test.
>>
>> - Qemu 4.2 512M guest with tb-size 256M
>> TB flush count      5, 5, 5
>> - Qemu 5.0 512M guest with tb-size 256M
>> TB flush count      5, 5, 5
>> - Qemu 5.0 512M guest with 256M default in code
>> TB flush count      5, 5, 5
>>
>> So performance wise the results are as much in-between as you'd think from a
>> TB size in between. And the memory consumption which (for me) is the actual
>> current issue to fix would be back in line again as expected.
>
> So I'm glad you have the workaround.
>
>>
>> So on one hand I'm suggesting something like:
>> --- a/accel/tcg/translate-all.c
>> +++ b/accel/tcg/translate-all.c
>> @@ -944,7 +944,7 @@ static void page_lock_pair(PageDesc **re
>>   * Users running large scale system emulation may want to tweak their
>>   * runtime setup via the tb-size control on the command line.
>>   */
>> -#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (1 * GiB)
>> +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (256 * MiB)
>
> The problem we have is any number we pick here is arbitrary. And while
> it did regress your use-case changing it again just pushes a performance
> regression onto someone else. The most (*) 64 bit desktop PCs have 16Gb
> of RAM, almost all have more than 8gb. And there is a workaround.
>
> * random number from Steams HW survey.
>
>>  #endif
>>  #endif
>>
>> OTOH I understand someone else might want to get the more speedy 1G
>> especially for large guests. If someone used to run a 4G guest in TCG the
>> TB Size was 1G all along.
>> How about picking the smaller of (1G || ram-size/4) as default?
>>
>> This might then look like:
>> --- a/accel/tcg/translate-all.c
>> +++ b/accel/tcg/translate-all.c
>> @@ -956,7 +956,12 @@ static inline size_t size_code_gen_buffe
>>  {
>>      /* Size the buffer.  */
>>      if (tb_size == 0) {
>> -        tb_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
>> +        unsigned long max_default = (unsigned long)(ram_size / 4);
>> +        if (max_default < DEFAULT_CODE_GEN_BUFFER_SIZE) {
>> +            tb_size = max_default;
>> +        } else {
>> +           tb_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
>> +        }
>>      }
>>      if (tb_size < MIN_CODE_GEN_BUFFER_SIZE) {
>>          tb_size = MIN_CODE_GEN_BUFFER_SIZE;
>>
>> This is a bit more tricky than it seems as ram_sizes is no more
>> present in that context but it is enough to discuss it.
>> That should serve all cases - small and large - better as a pure
>> static default of 1G or always ram/4?
>
> I'm definitely against re-introducing ram_size into the mix. The
> original commit (a1b18df9a4) that broke this introduced an ordering
> dependency which we don't want to bring back.
>
> I'd be more amenable to something that took into account host memory and
> limited the default if it was smaller than a threshold. Is there a way
> to probe that that doesn't involve slurping /proc/meminfo?

I agree that this should be dependent on host memory size not guest 
ram_size but it might be tricky to get that value because different host 
OSes would need different ways. Maybe a new qemu_host_mem_size portability 
function will be needed that implements this for different host OSes. 
POSIX may or may not have sysconf _SC_PHYS_PAGES and _SC_AVPHYS_PAGES and 
linux has sysinfo but don't know how reliable these are.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TB Cache size grows out of control with qemu 5.0
  2020-07-16 16:27     ` Alex Bennée
  2020-07-16 22:35       ` BALATON Zoltan
@ 2020-07-17 13:40       ` Christian Ehrhardt
  2020-07-17 14:27         ` Alex Bennée
  1 sibling, 1 reply; 7+ messages in thread
From: Christian Ehrhardt @ 2020-07-17 13:40 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Paolo Bonzini, Richard Henderson, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 8072 bytes --]

On Thu, Jul 16, 2020 at 6:27 PM Alex Bennée <alex.bennee@linaro.org> wrote:

>
> Christian Ehrhardt <christian.ehrhardt@canonical.com> writes:
>
> > On Wed, Jul 15, 2020 at 5:58 PM BALATON Zoltan <balaton@eik.bme.hu>
> wrote:
> >
> >> See commit 47a2def4533a2807e48954abd50b32ecb1aaf29a and the next two
> >> following it.
> >>
> >
> > Thank you Zoltan for pointing out this commit, I agree that this seems
> to be
> > the trigger for the issues I'm seeing. Unfortunately the common CI host
> size
> > is 1-2G. For example on Ubuntu Autopkgtests 1.5G.
> > Those of them running guests do so in 0.5-1G size in TCG mode
> > (as they often can't rely on having KVM available).
> >
> > The 1G TB buffer + 0.5G actual guest size + lack of dynamic downsizing
> > on memory pressure (never existed) makes these systems go OOM-Killing
> > the qemu process.
>
> Ooops. I admit the assumption was that most people running system
> emulation would be doing it on beefier machines.
>
> > The patches indicated that the TB flushes on a full guest boot are a
> > good indicator of the TB size efficiency. From my old checks I had:
> >
> > - Qemu 4.2 512M guest with 32M default overwritten by ram-size/4
> > TB flush count      14, 14, 16
> > - Qemu 5.0 512M guest with 1G default
> > TB flush count      1, 1, 1
> >
> > I agree that ram/4 seems odd, especially on huge guests that is a lot
> > potentially wasted. And most environments have a bit of breathing
> > room 1G is too big in small host systems and the common CI system falls
> > into this category. So I tuned it down to 256M for a test.
> >
> > - Qemu 4.2 512M guest with tb-size 256M
> > TB flush count      5, 5, 5
> > - Qemu 5.0 512M guest with tb-size 256M
> > TB flush count      5, 5, 5
> > - Qemu 5.0 512M guest with 256M default in code
> > TB flush count      5, 5, 5
> >
> > So performance wise the results are as much in-between as you'd think
> from a
> > TB size in between. And the memory consumption which (for me) is the
> actual
> > current issue to fix would be back in line again as expected.
>
> So I'm glad you have the workaround.
>
> >
> > So on one hand I'm suggesting something like:
> > --- a/accel/tcg/translate-all.c
> > +++ b/accel/tcg/translate-all.c
> > @@ -944,7 +944,7 @@ static void page_lock_pair(PageDesc **re
> >   * Users running large scale system emulation may want to tweak their
> >   * runtime setup via the tb-size control on the command line.
> >   */
> > -#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (1 * GiB)
> > +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (256 * MiB)
>
> The problem we have is any number we pick here is arbitrary. And while
> it did regress your use-case changing it again just pushes a performance
> regression onto someone else.


Thanks for your feedback Alex!

That is true "for you" since 5.0 is released from upstreams POV.
But from the downstreams POV no 5.0 exists for Ubuntu yet and I'd break
many places releasing it like that.
Sadly the performance gain to the other cases will most likely go unnoticed.


> The most (*) 64 bit desktop PCs have 16Gb
> of RAM, almost all have more than 8gb. And there is a workaround.
>

Due to our work around virtualization the values representing
"most 64 bit desktop PCs" aren't the only thing that matters :-)

...


> > This is a bit more tricky than it seems as ram_sizes is no more
> > present in that context but it is enough to discuss it.
> > That should serve all cases - small and large - better as a pure
> > static default of 1G or always ram/4?
>
> I'm definitely against re-introducing ram_size into the mix. The
> original commit (a1b18df9a4) that broke this introduced an ordering
> dependency which we don't want to bring back.
>

I agree with that reasoning, but currently without any size dependency
the "arbitrary value" we picked to be 1G is even more fixed than it was
before.
Compared to pre v5.0 for now I can only decide to
a) tune it down -> performance impact for huge guests
b) keep it at 1G -> functional breakage with small hosts

I'd be more amenable to something that took into account host memory and
> limited the default if it was smaller than a threshold. Is there a way
> to probe that that doesn't involve slurping /proc/meminfo?
>

I agree that a host-size dependency might be the better way to go,
yet I have no great cross-platform resilient way to get that.
Maybe we can make it like "if I can get some value consider it,
otherwise use the current default".
That would improve many places already, while keeping the rest at the
current behavior.


> >
> > P.S. I added Alex being the Author of the offending patch and
> Richard/Paolo
> > for being listed in the Maintainers file for TCG.
>
>
>
From Zoltan (unifying the thread a bit):

> I agree that this should be dependent on host memory size not guest
> ram_size but it might be tricky to get that value because different host
> OSes would need different ways.

Well - where it isn't available we will continue to take the default
qemu 5.0 already had. If required on other platforms as well they can add
their way of host memory detection into this as needed.

> Maybe a new qemu_host_mem_size portability
> function will be needed that implements this for different host OSes.
> POSIX may or may not have sysconf _SC_PHYS_PAGES and _SC_AVPHYS_PAGES

We should not try to get into the business of _SC_AVPHYS_PAGES and
try to understand/assume what might be cache or otherwise (re)usable.
Since we only look for some alignment to hosts size _SC_PHYS_PAGES should
be good enough and available in more places than the other options.

> and linux has sysinfo but don't know how reliable these are.

sysconf is slightly more widely available than sysinfo and has enough for
what we need.


I have combined the thoughts above into a patch and it works well in
my tests.

32G Host:
pages 8187304.000000
pagesize 4096.000000
max_default 4191899648
final tb_size 1073741824

(qemu) info jit
Translation buffer state:
gen code size       210425059/1073736659
TB count            368273
TB avg target size  20 max=1992 bytes
TB avg host size    330 bytes (expansion ratio: 16.1)
cross page TB count 1656 (0%)
direct jump count   249813 (67%) (2 jumps=182112 49%)
TB hash buckets     197613/262144 (75.38% head buckets used)
TB hash occupancy   34.15% avg chain occ. Histogram: [0,10)%|▆ █
 ▅▁▃▁▁|[90,100]%
TB hash avg chain   1.020 buckets. Histogram: 1|█▁▁|3

Statistics:
TB flush count      1
TB invalidate count 451673
TLB full flushes    0
TLB partial flushes 154819
TLB elided flushes  191627
[TCG profiler not compiled]

=> 1G TB size not changed compared to v5.0 - as intended


But on a small 1.5G Host it now works without OOM:

pages 379667.000000
pagesize 4096.000000
max_default 194389504
final tb_size 194389504

(qemu) info jit
Translation buffer state:
gen code size       86179731/194382803
TB count            149995
TB avg target size  20 max=1992 bytes
TB avg host size    333 bytes (expansion ratio: 16.5)
cross page TB count 716 (0%)
direct jump count   98854 (65%) (2 jumps=74962 49%)
TB hash buckets     58842/65536 (89.79% head buckets used)
TB hash occupancy   51.46% avg chain occ. Histogram: [0,10)%|▃ ▇
 █▂▆▁▄|[90,100]%
TB hash avg chain   1.091 buckets. Histogram: 1|█▁▁|3

Statistics:
TB flush count      10
TB invalidate count 31733
TLB full flushes    0
TLB partial flushes 180891
TLB elided flushes  244107
[TCG profiler not compiled]

=> ~185M which is way more reasonable given the host size

The patch will have a rather large comment in it, I'm not sure if the full
comment is needed, but I wanted to leave a trace what&why is going
on in the code for the next one who comes by.

Submitting as a proper patch to the list in a bit ...

-- 
> Alex Bennée
>


-- 
Christian Ehrhardt
Staff Engineer, Ubuntu Server
Canonical Ltd

[-- Attachment #2: Type: text/html, Size: 10508 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TB Cache size grows out of control with qemu 5.0
  2020-07-17 13:40       ` Christian Ehrhardt
@ 2020-07-17 14:27         ` Alex Bennée
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Bennée @ 2020-07-17 14:27 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Paolo Bonzini, Richard Henderson, qemu-devel


Christian Ehrhardt <christian.ehrhardt@canonical.com> writes:

> On Thu, Jul 16, 2020 at 6:27 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>>
>> Christian Ehrhardt <christian.ehrhardt@canonical.com> writes:
>>
>> > On Wed, Jul 15, 2020 at 5:58 PM BALATON Zoltan <balaton@eik.bme.hu>
>> wrote:
>> >
>> >> See commit 47a2def4533a2807e48954abd50b32ecb1aaf29a and the next two
>> >> following it.
>> >>
>> >
>> > Thank you Zoltan for pointing out this commit, I agree that this seems
>> to be
>> > the trigger for the issues I'm seeing. Unfortunately the common CI host
>> size
>> > is 1-2G. For example on Ubuntu Autopkgtests 1.5G.
>> > Those of them running guests do so in 0.5-1G size in TCG mode
>> > (as they often can't rely on having KVM available).
>> >
>> > The 1G TB buffer + 0.5G actual guest size + lack of dynamic downsizing
>> > on memory pressure (never existed) makes these systems go OOM-Killing
>> > the qemu process.
>>
>> Ooops. I admit the assumption was that most people running system
>> emulation would be doing it on beefier machines.
>>
>> > The patches indicated that the TB flushes on a full guest boot are a
>> > good indicator of the TB size efficiency. From my old checks I had:
>> >
>> > - Qemu 4.2 512M guest with 32M default overwritten by ram-size/4
>> > TB flush count      14, 14, 16
>> > - Qemu 5.0 512M guest with 1G default
>> > TB flush count      1, 1, 1
>> >
>> > I agree that ram/4 seems odd, especially on huge guests that is a lot
>> > potentially wasted. And most environments have a bit of breathing
>> > room 1G is too big in small host systems and the common CI system falls
>> > into this category. So I tuned it down to 256M for a test.
>> >
>> > - Qemu 4.2 512M guest with tb-size 256M
>> > TB flush count      5, 5, 5
>> > - Qemu 5.0 512M guest with tb-size 256M
>> > TB flush count      5, 5, 5
>> > - Qemu 5.0 512M guest with 256M default in code
>> > TB flush count      5, 5, 5
>> >
>> > So performance wise the results are as much in-between as you'd think
>> from a
>> > TB size in between. And the memory consumption which (for me) is the
>> actual
>> > current issue to fix would be back in line again as expected.
>>
>> So I'm glad you have the workaround.
>>
>> >
>> > So on one hand I'm suggesting something like:
>> > --- a/accel/tcg/translate-all.c
>> > +++ b/accel/tcg/translate-all.c
>> > @@ -944,7 +944,7 @@ static void page_lock_pair(PageDesc **re
>> >   * Users running large scale system emulation may want to tweak their
>> >   * runtime setup via the tb-size control on the command line.
>> >   */
>> > -#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (1 * GiB)
>> > +#define DEFAULT_CODE_GEN_BUFFER_SIZE_1 (256 * MiB)
>>
>> The problem we have is any number we pick here is arbitrary. And while
>> it did regress your use-case changing it again just pushes a performance
>> regression onto someone else.
>
>
> Thanks for your feedback Alex!
>
> That is true "for you" since 5.0 is released from upstreams POV.
> But from the downstreams POV no 5.0 exists for Ubuntu yet and I'd break
> many places releasing it like that.
> Sadly the performance gain to the other cases will most likely go unnoticed.
>
>
>> The most (*) 64 bit desktop PCs have 16Gb
>> of RAM, almost all have more than 8gb. And there is a workaround.
>>
>
> Due to our work around virtualization the values representing
> "most 64 bit desktop PCs" aren't the only thing that matters :-)
>
> ...
>
>
>> > This is a bit more tricky than it seems as ram_sizes is no more
>> > present in that context but it is enough to discuss it.
>> > That should serve all cases - small and large - better as a pure
>> > static default of 1G or always ram/4?
>>
>> I'm definitely against re-introducing ram_size into the mix. The
>> original commit (a1b18df9a4) that broke this introduced an ordering
>> dependency which we don't want to bring back.
>>
>
> I agree with that reasoning, but currently without any size dependency
> the "arbitrary value" we picked to be 1G is even more fixed than it was
> before.
> Compared to pre v5.0 for now I can only decide to
> a) tune it down -> performance impact for huge guests
> b) keep it at 1G -> functional breakage with small hosts
>
> I'd be more amenable to something that took into account host memory and
>> limited the default if it was smaller than a threshold. Is there a way
>> to probe that that doesn't involve slurping /proc/meminfo?
>>
>
> I agree that a host-size dependency might be the better way to go,
> yet I have no great cross-platform resilient way to get that.
> Maybe we can make it like "if I can get some value consider it,
> otherwise use the current default".
> That would improve many places already, while keeping the rest at the
> current behavior.
>
>
>> >
>> > P.S. I added Alex being the Author of the offending patch and
>> Richard/Paolo
>> > for being listed in the Maintainers file for TCG.
>>
>>
>>
> From Zoltan (unifying the thread a bit):
>
>> I agree that this should be dependent on host memory size not guest
>> ram_size but it might be tricky to get that value because different host
>> OSes would need different ways.
>
> Well - where it isn't available we will continue to take the default
> qemu 5.0 already had. If required on other platforms as well they can add
> their way of host memory detection into this as needed.
>
>> Maybe a new qemu_host_mem_size portability
>> function will be needed that implements this for different host OSes.
>> POSIX may or may not have sysconf _SC_PHYS_PAGES and _SC_AVPHYS_PAGES
>
> We should not try to get into the business of _SC_AVPHYS_PAGES and
> try to understand/assume what might be cache or otherwise (re)usable.
> Since we only look for some alignment to hosts size _SC_PHYS_PAGES should
> be good enough and available in more places than the other options.
>
>> and linux has sysinfo but don't know how reliable these are.
>
> sysconf is slightly more widely available than sysinfo and has enough for
> what we need.
>
>
> I have combined the thoughts above into a patch and it works well in
> my tests.
>
> 32G Host:
> pages 8187304.000000
> pagesize 4096.000000
> max_default 4191899648
> final tb_size 1073741824
>
> (qemu) info jit
> Translation buffer state:
> gen code size       210425059/1073736659
> TB count            368273
> TB avg target size  20 max=1992 bytes
> TB avg host size    330 bytes (expansion ratio: 16.1)
> cross page TB count 1656 (0%)
> direct jump count   249813 (67%) (2 jumps=182112 49%)
> TB hash buckets     197613/262144 (75.38% head buckets used)
> TB hash occupancy   34.15% avg chain occ. Histogram: [0,10)%|▆ █
>  ▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.020 buckets. Histogram: 1|█▁▁|3
>
> Statistics:
> TB flush count      1
> TB invalidate count 451673
> TLB full flushes    0
> TLB partial flushes 154819
> TLB elided flushes  191627
> [TCG profiler not compiled]
>
> => 1G TB size not changed compared to v5.0 - as intended
>
>
> But on a small 1.5G Host it now works without OOM:
>
> pages 379667.000000
> pagesize 4096.000000
> max_default 194389504
> final tb_size 194389504
>
> (qemu) info jit
> Translation buffer state:
> gen code size       86179731/194382803
> TB count            149995
> TB avg target size  20 max=1992 bytes
> TB avg host size    333 bytes (expansion ratio: 16.5)
> cross page TB count 716 (0%)
> direct jump count   98854 (65%) (2 jumps=74962 49%)
> TB hash buckets     58842/65536 (89.79% head buckets used)
> TB hash occupancy   51.46% avg chain occ. Histogram: [0,10)%|▃ ▇
>  █▂▆▁▄|[90,100]%
> TB hash avg chain   1.091 buckets. Histogram: 1|█▁▁|3
>
> Statistics:
> TB flush count      10
> TB invalidate count 31733
> TLB full flushes    0
> TLB partial flushes 180891
> TLB elided flushes  244107
> [TCG profiler not compiled]
>
> => ~185M which is way more reasonable given the host size
>
> The patch will have a rather large comment in it, I'm not sure if the full
> comment is needed, but I wanted to leave a trace what&why is going
> on in the code for the next one who comes by.
>
> Submitting as a proper patch to the list in a bit ...

Ahh did you not get CC'ed by my patch this morning?

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-17 14:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 14:29 TB Cache size grows out of control with qemu 5.0 Christian Ehrhardt
2020-07-15 15:58 ` BALATON Zoltan
2020-07-16 13:25   ` Christian Ehrhardt
2020-07-16 16:27     ` Alex Bennée
2020-07-16 22:35       ` BALATON Zoltan
2020-07-17 13:40       ` Christian Ehrhardt
2020-07-17 14:27         ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.