From: Doug Smythies <dsmythies@telus.net>
To: Zhang Rui <rui.zhang@intel.com>
Cc: daniel.lezcano@linaro.org, srinivas.pandruvada@linux.intel.com,
linux-pm@vger.kernel.org
Subject: Re: [PATCH] thermal/intel: introduce tcc cooling driver
Date: Sat, 30 Jan 2021 08:58:02 -0800 [thread overview]
Message-ID: <CAAYoRsXcPSJ5N-n6FC31kRkewJYEXC9PSk+r7gnMrt40ppUDAQ@mail.gmail.com> (raw)
In-Reply-To: <53074020f2b19a38811eec925457e828581658f3.camel@intel.com>
On Thu, Jan 28, 2021 at 9:30 AM Zhang Rui <rui.zhang@intel.com> wrote:
> On Tue, 2021-01-26 at 11:18 -0800, Doug Smythies wrote:
> > On 2021.01.16 09:08 Doug Smythies wrote:
> > > On 2021.01.15 Zhang Rui wrote:
...
> > They should have been: RATL and RATLL.
> >
> > From the proper page of the book:
> >
> > > Running Average Thermal Limit Status (R0)
> > > When set, frequency is reduced below the operating
> > > system request due to Running Average Thermal Limit
> > > (RATL).
> >
>
> > 2.) Due to the already discussed turbostat issue, that was not
> > the actual temperature and so the RATL bit being set was actually
> > valid at that time.
> >
> On my side, I got the "Thermal status bit" set.
Yes, and if I understand your comment correctly, you are referring
to IA32_THERM_STATUS (0X19C) and/or
IA32_PACKAGE_THERM_STATUS (0X1B1). I am referring to
MSR_CORE_PERF_LIMIT_REASONS (0X64F).
>
> > I have not been able to find the time window knob for this, if there
> > even is one, similar to the time window knobs for the package power
> > limits.
I just assume there is a time window, similar to the RAPL based
power limits. But I haven't found it.
> > I wanted to reduce the time constant, just as a test, in an attempt
> > to reduce the step function load potential temperature overshoot.
...
> >
> Thanks for your test.
> I'd prefer this is platform specific.
> Because it behaves really differently from what I observed.
O.K. These oddities aside, in the end it does do
the expected job.
> 99.06 2036 14195 66 12.27
> 99.07 2007 14240 66 12.07
> 99.12 2888 12147 98 28.23 <<< offset cleared
> 99.03 3413 11503 98 37.21
> 98.96 3317 11698 98 34.64
very close to critical temp.
I never knowingly allow my processor
to go above 80 degrees.
Although, I admit it hit 90 degrees a couple of
times during this work.
> 99.07 3246 11410 98 32.89
> 98.95 3210 12107 98 32.13
> 98.94 3164 11790 98 31.08
> 99.00 3124 12106 98 30.84
> 99.00 3086 11876 98 29.60
> 98.94 3054 12482 98 29.00
> 98.89 3030 12629 98 28.54
> 99.39 2377 10764 82 17.62 <<< Didn't do anything, so it
> is probably thermald or something
or critical temp hit.
>
> I tried both tests, and the results are the same, in both cases, it
> starts throttling immediately (within a second), and no over-throttling
> observed.
>
> Do you have a script to do this?
No, all of my tests were done manually, varing:
. placement of high loads on some cores for more heat over smaller surface area.
. balance between 100% CPU load at max heat verses 100% CPU load at less heat.
. balance between this TCC Offset throttling verses package power limits
. using ambient (coolant temperature) as a heat removal capacity knob.
In summary: I played around until I found something interesting.
> Say, run turbostat in background and
> then change tcc offset at certain timestamp? Maybe we can try exactly
> the same test on different machines.
I had an idea, and wasted way way too much time trying to make it work.
I thought to just get turbostat to also show the offset, so then we know for
certain when it changed. I tried virtually all combinations of:
turbostat --Summary --quiet --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,,,,TCC --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
turbostat --Summary --quiet --add msr0x1a2,u32,package,raw,TCC --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
and could never get it to work in "Summary" mode. (note: about 95% of
my use of turbostat is in "Summary" mode.)
Anyway, after too long, I did get this to work:
turbostat --quiet --cpu 0 --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,u32,,raw,TCC
--show CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1 | grep "^ 0"
Example 1:
turbostat --quiet --cpu 0 --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,u32,,raw,TCC
--show CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1 | grep "^0"
CPU Busy% Bzy_MHz IRQ TCC PkgTmp PkgWatt
0 100.26 4500 1002 0x00000001 78 99.88 <<< Offset = 1
0 100.26 4501 1002 0x00000001 77 99.90 <<<
steady state power limit throttle
0 100.26 4501 1004 0x00000001 77 99.92
0 100.26 4500 1002 0x0000001e 78 99.91 <<<
offset changed, trip int 70
0 100.25 4502 1003 0x0000001e 77 100.03
0 100.25 4503 1002 0x0000001e 77 99.85
0 100.25 4502 1002 0x0000001e 78 99.92
0 100.26 4501 1003 0x0000001e 78 99.95
0 100.25 4503 1002 0x0000001e 77 99.88
0 100.25 4502 1002 0x0000001e 78 99.86
0 100.25 4502 1004 0x0000001e 77 99.92
0 100.25 4503 1002 0x0000001e 77 99.98
0 100.25 4502 1002 0x0000001e 77 99.88
0 100.26 4498 1004 0x0000001e 77 100.06
0 100.26 4501 1002 0x0000001e 78 99.77
0 100.26 4500 1002 0x0000001e 78 99.53
0 100.26 4430 1002 0x0000001e 72 91.19 <<<
Thermal throttling. 13 Seconds
0 100.26 4400 1002 0x0000001e 72 87.55
0 100.26 4400 1002 0x0000001e 71 87.52
0 100.26 4400 1005 0x0000001e 71 87.56
0 100.26 4400 1002 0x0000001e 72 87.53
Example 2:
0 100.26 4600 1002 0x00000000 83 113.26 <<< Offset = 0
0 100.26 4600 1002 0x00000000 84 113.43
0 100.25 4599 1002 0x00000000 83 113.42 <<< No
power limit throttle yet.
0 100.26 4600 1004 0x00000000 83 113.40 <<< Not
steady state.
0 100.26 4600 1002 0x00000000 83 113.25
0 100.25 3797 1003 0x00000018 56 54.11 <<<
Overshoot is immediate.
0 100.26 3700 1002 0x00000018 56 47.09
0 100.26 3700 1002 0x00000018 55 47.08
0 100.26 3700 1002 0x00000018 54 46.98
0 100.26 3820 1002 0x00000018 58 51.62 <<<
starts to recover.
0 100.26 4016 1002 0x00000018 62 61.55
0 100.26 4177 1002 0x00000018 64 69.91
0 100.26 4275 1004 0x00000018 68 75.81
0 100.26 4300 1002 0x00000018 68 77.36
0 100.26 4371 1002 0x00000018 71 84.53
0 100.26 4400 1002 0x00000018 72 87.52
0 100.26 4400 1003 0x00000018 72 87.62
Example 3:
This test is specifically an attempt to test the TCC Offset in the exact
way I intend to use it. trip point = 75 degrees, and never changes.
Power limit 2 is 115 watts, timing window short.
Power limit 1 is 100 watts , timing window 8 seconds.
Note: all previous work was with the timing window at 28 seconds.
Note: typically temperature < 75 at 100 watts.
The load is 4 prime95 maximum heat threads, plus 0 weaker memory
hammering threads.
The collant had to be preheated for about an hour before this test
started, otherwise
the processor would not get hot enough before package power limit 1
took over the
throttling duties.
Now, watching the TCC offset is useless for this test, so let's watch
MSR_CORE_PERF_LIMIT_REASONS instead:
turbostat --add msr0x64f,u32,,raw,TCC --show
CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ,RAMWatt --interval 1 | grep "^0"
(O.K., I should have changed the added column name. I filter it
anyhow, but manually added back, edited.)
CPU Busy% Bzy_MHz IRQ TCC PkgTmp PkgWatt RAMWatt
0 0.07 1081 5 0x08200000 38 2.31 0.45
<<< Note high idle start temp.
0 0.16 824 11 0x08200000 38 2.12 0.45
0 1.74 3430 44 0x00000000 38 2.65 0.45
<<< clear last times log bits
0 0.16 851 6 0x00000000 37 2.27 0.45
0 4.32 3313 269 0x00000000 75 47.15 0.45
<<< load applied
0 4.24 4585 458 0x08000800 78 97.16 0.45
<<< package power limit 2
0 2.80 4588 482 0x08000000 77 97.49 0.45
<<< temperature just high
0 2.87 4593 463 0x08000000 78 97.95 0.45
0 3.39 4600 465 0x08000000 78 97.68 0.45
0 2.66 4600 462 0x08000000 78 97.55 0.45
0 2.28 4584 490 0x08000000 78 97.97 0.45
0 3.29 4583 478 0x08000000 78 97.72 0.45
0 3.24 4595 465 0x08000000 77 97.52 0.45
0 2.47 4600 465 0x08000000 78 97.50 0.45
0 4.18 4570 464 0x08000000 78 97.72 0.45
0 2.51 4600 470 0x08000000 78 97.40 0.45
0 1.77 4601 482 0x08000000 78 97.33 0.45
0 3.13 4584 462 0x08000000 78 97.57 0.45
0 3.06 4600 466 0x08000000 78 97.77 0.45
0 2.86 4592 461 0x08000000 78 97.56 0.45
0 2.85 4569 486 0x08000000 78 97.99 0.45
0 2.96 4600 465 0x08000000 78 97.91 0.45
0 3.00 4585 451 0x08000000 78 97.68 0.45
0 2.06 4600 475 0x08000000 78 97.50 0.45
0 3.05 4594 462 0x08000000 78 97.78 0.45
0 3.11 4592 461 0x08000000 78 97.68 0.45
0 2.31 4546 463 0x08200020 73 93.00 0.45 <<< RATL
0 2.80 4525 454 0x08200000 78 91.29 0.45
<<< Oscillates within
0 3.32 4538 445 0x08200020 73 91.61 0.45
<<< 1 pstate
0 3.27 4557 434 0x08200000 78 93.12 0.45
0 3.26 4523 470 0x08200020 73 89.85 0.45
<<< rough estimate is
0 2.48 4586 466 0x08200020 74 95.67 0.45
<<< oscillation costs 0.4%
0 1.95 4521 468 0x08200000 76 87.93 0.45
<<< performance loss verses
0 3.28 4569 449 0x08200020 73 94.67 0.45
<<< the power limit 2 servo.
0 0.44 4546 495 0x08200000 78 91.77 0.45
<<< (very crude, hard to defend
0 1.91 4518 487 0x08200020 73 91.24 0.45 <<< data.)
0 3.25 4539 460 0x08200000 78 91.63 0.45
0 2.51 4546 469 0x08200020 74 91.12 0.45
0 3.60 4540 453 0x08200000 77 91.43 0.45
0 3.06 4542 463 0x08200020 73 91.56 0.45
... Doug
prev parent reply other threads:[~2021-01-30 16:59 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-15 9:47 [PATCH] thermal/intel: introduce tcc cooling driver Zhang Rui
2021-01-16 17:08 ` Doug Smythies
2021-01-16 21:21 ` Doug Smythies
2021-01-18 9:31 ` Zhang, Rui
2021-01-19 7:10 ` Doug Smythies
2021-01-18 9:46 ` Zhang, Rui
2021-01-28 17:32 ` Zhang Rui
2021-01-26 19:18 ` Doug Smythies
2021-01-28 17:29 ` Zhang Rui
2021-01-30 16:58 ` Doug Smythies [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAAYoRsXcPSJ5N-n6FC31kRkewJYEXC9PSk+r7gnMrt40ppUDAQ@mail.gmail.com \
--to=dsmythies@telus.net \
--cc=daniel.lezcano@linaro.org \
--cc=linux-pm@vger.kernel.org \
--cc=rui.zhang@intel.com \
--cc=srinivas.pandruvada@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.