All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
@ 2024-02-17 13:01 Roman Benes
  2024-02-17 13:30 ` Greg KH
  2024-02-20 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis)
  0 siblings, 2 replies; 28+ messages in thread
From: Roman Benes @ 2024-02-17 13:01 UTC (permalink / raw)
  To: regressions

Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 
6700XT, mesa, archlinux) and I cannot get power cap as low as before(to 
115W), neither with Corectrl, LACT or TuxClocker and /sys have a 
variable read-only even for root. This is not of above apps issue but of 
the kernel, I read similar issues from other bug reports of above apps. 
I downgraded to v6.6.10 kernel and my 115W(under power)cap work again as 
before.

Please bring the low range back as efficiency vs power consumption is 
significant(links were not allowed, I posted in reddit's Linux sub about 
it and I am going to gitlab's drm/amd as well, but its like 90W and < 
10% performance tradeoff).

With regards, Romano



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-17 13:01 Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu) Roman Benes
@ 2024-02-17 13:30 ` Greg KH
  2024-02-19 11:15   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-20 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis)
  1 sibling, 1 reply; 28+ messages in thread
From: Greg KH @ 2024-02-17 13:30 UTC (permalink / raw)
  To: Roman Benes; +Cc: regressions

On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> even for root. This is not of above apps issue but of the kernel, I read
> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> kernel and my 115W(under power)cap work again as before.

Any chance you can use 'git bisect' to figure out the offending change?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-17 13:30 ` Greg KH
@ 2024-02-19 11:15   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-19 11:31     ` Roman Benes
                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-19 11:15 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Pan, Xinhui
  Cc: regressions, Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter,
	Greg KH, Roman Benes

On 17.02.24 14:30, Greg KH wrote:
> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>> even for root. This is not of above apps issue but of the kernel, I read
>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>> kernel and my 115W(under power)cap work again as before.
> 
> Any chance you can use 'git bisect' to figure out the offending change?

For the record and everyone that lands here: the cause is known now
(it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
value") [v6.7-rc1]) and the issue afaics tracked here:

https://gitlab.freedesktop.org/drm/amd/-/issues/3183

Other mentions:
https://gitlab.freedesktop.org/drm/amd/-/issues/3137
https://gitlab.freedesktop.org/drm/amd/-/issues/2992

Haven't seen any statement from the amdgpu developers (now CCed) yet on
this there (but might have missed something!). From what I can see I
assume this will likely be somewhat tricky to handle, as a revert
overall might be a bad idea here. We'll see I guess.

Roman posted something that apparently was meant to go to the list, so
let me put it here:

"""
UPDATE: User fililip already posted patch, but it need to be merged,
discussion is on gitlab link below.

(PS: I hope I am replying correctly to "all" now? - using original addr.)


> it seems that commit was already found(see user's 'fililip' comment):
>
> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> commit 1958946858a62b6b5392ed075aa219d199bcae39
> Author: Ma Jun <Jun.Ma2@amd.com>
> Date:   Thu Oct 12 09:33:45 2023 +0800
>
>     drm/amd/pm: Support for getting power1_cap_min value
>
>     Support for getting power1_cap_min value on smu13 and smu11.
>     For other Asics, we still use 0 as the default value.
>
>     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>
> However, this is not good as it remove under-powering range too far. I
was getting only about 7% less performance but 90W(!) less consumption
when set to my 115W before. Also I wonder if we as a OS of options and
freedom have to stick to such very high reference for min values without
ability to override them through some sys ctrls. Commit was done by amd
guy and I wonder if because of maybe this post that I made few months
ago(business strategy?):
>
>
https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>
> This is not a dangerous OC upwards where I can understand desire to
protect HW, it is downward, having min cap at 190W when card pull on
115W almost same speed is IMO crazy to deny. We don't talk about default
or reference values here either, just a move to lower the range of
options for whatever reason.
>
> I don't know how much power you guys have over them, but please
consider either reverting this change, or give us an option to set
min_cap through say /sys (right now param is readonly, even for root).
>
>
> Thank you in advance for looking into this, with regards:  Romano
"""

And while at it, let me add this issue to the tracking as well

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot introduced 1958946858a62b /
#regzbot title drm: amdgpu: under-powering broke

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-19 11:15   ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-19 11:31     ` Roman Benes
  2024-02-19 11:35     ` Romano
  2024-02-20 14:45     ` Alex Deucher
  2 siblings, 0 replies; 28+ messages in thread
From: Roman Benes @ 2024-02-19 11:31 UTC (permalink / raw)
  To: Linux regressions mailing list, Alex Deucher,
	Christian König, Pan, Xinhui
  Cc: Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

[-- Attachment #1: Type: text/plain, Size: 5024 bytes --]

Hello everyone,

patch by user @fililip was posted there, but not submitted:

/"I think I'd have to submit it to the linux kernel mailing list, which 
I am kinda scared of 😅. It could be better to submit that patch to Arch 
Linux maintainers; they could include it in their kernel builds."/

Implementation of this patch can be simplified by simply setting:

|smu->min_power_limit = amdgpu_ignore_min_pcap ? 0 : 
whatever_default_smuxx;|

and then leave rest of the code unchanged(except defining 
|amdgpu_ignore_min_pcap |variable of course). Nothing tricky nor need to 
revert anything should be needed I hope. Please add it to the general 
kernel as an option, it certainly should not be related to Archlinux only.

Roman


On 2/19/24 12:15, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 17.02.24 14:30, Greg KH wrote:
>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>> even for root. This is not of above apps issue but of the kernel, I read
>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>> kernel and my 115W(under power)cap work again as before.
>> Any chance you can use 'git bisect' to figure out the offending change?
> For the record and everyone that lands here: the cause is known now
> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> value") [v6.7-rc1]) and the issue afaics tracked here:
>
> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>
> Other mentions:
> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>
> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> this there (but might have missed something!). From what I can see I
> assume this will likely be somewhat tricky to handle, as a revert
> overall might be a bad idea here. We'll see I guess.
>
> Roman posted something that apparently was meant to go to the list, so
> let me put it here:
>
> """
> UPDATE: User fililip already posted patch, but it need to be merged,
> discussion is on gitlab link below.
>
> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>
>
>> it seems that commit was already found(see user's 'fililip' comment):
>>
>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>> Author: Ma Jun<Jun.Ma2@amd.com>
>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>
>>      drm/amd/pm: Support for getting power1_cap_min value
>>
>>      Support for getting power1_cap_min value on smu13 and smu11.
>>      For other Asics, we still use 0 as the default value.
>>
>>      Signed-off-by: Ma Jun<Jun.Ma2@amd.com>
>>      Reviewed-by: Kenneth Feng<kenneth.feng@amd.com>
>>      Signed-off-by: Alex Deucher<alexander.deucher@amd.com>
>>
>> However, this is not good as it remove under-powering range too far. I
> was getting only about 7% less performance but 90W(!) less consumption
> when set to my 115W before. Also I wonder if we as a OS of options and
> freedom have to stick to such very high reference for min values without
> ability to override them through some sys ctrls. Commit was done by amd
> guy and I wonder if because of maybe this post that I made few months
> ago(business strategy?):
>>
> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>> This is not a dangerous OC upwards where I can understand desire to
> protect HW, it is downward, having min cap at 190W when card pull on
> 115W almost same speed is IMO crazy to deny. We don't talk about default
> or reference values here either, just a move to lower the range of
> options for whatever reason.
>> I don't know how much power you guys have over them, but please
> consider either reverting this change, or give us an option to set
> min_cap through say /sys (right now param is readonly, even for root).
>>
>> Thank you in advance for looking into this, with regards:  Romano
> """
>
> And while at it, let me add this issue to the tracking as well
>
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot introduced 1958946858a62b /
> #regzbot title drm: amdgpu: under-powering broke
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.

[-- Attachment #2: Type: text/html, Size: 7395 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-19 11:15   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-19 11:31     ` Roman Benes
@ 2024-02-19 11:35     ` Romano
  2024-02-20 14:45     ` Alex Deucher
  2 siblings, 0 replies; 28+ messages in thread
From: Romano @ 2024-02-19 11:35 UTC (permalink / raw)
  To: Linux regressions mailing list, Alex Deucher,
	Christian König, Pan, Xinhui
  Cc: Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

Hello everyone,

patch by user @fililip was posted there, but not submitted:

"I think I'd have to submit it to the linux kernel mailing list, which I 
am kinda scared of 😅. It could be better to submit that patch to Arch 
Linux maintainers; they could include it in their kernel builds."

Implementation of this patch can be simplified by simply setting:

smu->min_power_limit = amdgpu_ignore_min_pcap ? 0 : whatever_default_smuxx;

and then leave rest of the code unchanged(except defining 
amdgpu_ignore_min_pcap variable of course). Nothing tricky nor need to 
revert anything should be needed I hope. Please add it to the general 
kernel as an option, it certainly should not be related to Archlinux only.

Roman



On 2/19/24 12:15, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 17.02.24 14:30, Greg KH wrote:
>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>> even for root. This is not of above apps issue but of the kernel, I read
>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>> kernel and my 115W(under power)cap work again as before.
>> Any chance you can use 'git bisect' to figure out the offending change?
> For the record and everyone that lands here: the cause is known now
> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> value") [v6.7-rc1]) and the issue afaics tracked here:
>
> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>
> Other mentions:
> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>
> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> this there (but might have missed something!). From what I can see I
> assume this will likely be somewhat tricky to handle, as a revert
> overall might be a bad idea here. We'll see I guess.
>
> Roman posted something that apparently was meant to go to the list, so
> let me put it here:
>
> """
> UPDATE: User fililip already posted patch, but it need to be merged,
> discussion is on gitlab link below.
>
> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>
>
>> it seems that commit was already found(see user's 'fililip' comment):
>>
>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>> Author: Ma Jun <Jun.Ma2@amd.com>
>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>
>>      drm/amd/pm: Support for getting power1_cap_min value
>>
>>      Support for getting power1_cap_min value on smu13 and smu11.
>>      For other Asics, we still use 0 as the default value.
>>
>>      Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>      Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>      Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>
>> However, this is not good as it remove under-powering range too far. I
> was getting only about 7% less performance but 90W(!) less consumption
> when set to my 115W before. Also I wonder if we as a OS of options and
> freedom have to stick to such very high reference for min values without
> ability to override them through some sys ctrls. Commit was done by amd
> guy and I wonder if because of maybe this post that I made few months
> ago(business strategy?):
>>
> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>> This is not a dangerous OC upwards where I can understand desire to
> protect HW, it is downward, having min cap at 190W when card pull on
> 115W almost same speed is IMO crazy to deny. We don't talk about default
> or reference values here either, just a move to lower the range of
> options for whatever reason.
>> I don't know how much power you guys have over them, but please
> consider either reverting this change, or give us an option to set
> min_cap through say /sys (right now param is readonly, even for root).
>>
>> Thank you in advance for looking into this, with regards:  Romano
> """
>
> And while at it, let me add this issue to the tracking as well
>
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot introduced 1958946858a62b /
> #regzbot title drm: amdgpu: under-powering broke
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-17 13:01 Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu) Roman Benes
  2024-02-17 13:30 ` Greg KH
@ 2024-02-20 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis)
  1 sibling, 0 replies; 28+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2024-02-20 11:20 UTC (permalink / raw)
  To: regressions

On 17.02.24 14:01, Roman Benes wrote:
> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX
> 6700XT, mesa, archlinux) and I cannot get power cap as low as before(to
> 115W), neither with Corectrl, LACT or TuxClocker and /sys have a
> variable read-only even for root. This is not of above apps issue but of
> the kernel, I read similar issues from other bug reports of above apps.
> I downgraded to v6.6.10 kernel and my 115W(under power)cap work again as
> before.
> 
> Please bring the low range back as efficiency vs power consumption is
> significant(links were not allowed, I posted in reddit's Linux sub about
> it and I am going to gitlab's drm/amd as well, but its like 90W and <
> 10% performance tradeoff).

#regzbot ^introduced 1958946858a62b
#regzbot link: https://gitlab.freedesktop.org/drm/amd/-/issues/3183
#regzbot link: https://gitlab.freedesktop.org/drm/amd/-/issues/3137
#regzbot title drm: amdgpu: under-powering broke

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-19 11:15   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-19 11:31     ` Roman Benes
  2024-02-19 11:35     ` Romano
@ 2024-02-20 14:45     ` Alex Deucher
  2024-02-20 15:03       ` Linux regression tracking (Thorsten Leemhuis)
  2 siblings, 1 reply; 28+ messages in thread
From: Alex Deucher @ 2024-02-20 14:45 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Alex Deucher, Christian König, Pan, Xinhui, Ma Jun, amd-gfx,
	Dave Airlie, Daniel Vetter, Greg KH, Roman Benes

On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> On 17.02.24 14:30, Greg KH wrote:
> > On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> >> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> >> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> >> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> >> even for root. This is not of above apps issue but of the kernel, I read
> >> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> >> kernel and my 115W(under power)cap work again as before.
> >
> > Any chance you can use 'git bisect' to figure out the offending change?
>
> For the record and everyone that lands here: the cause is known now
> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> value") [v6.7-rc1]) and the issue afaics tracked here:
>
> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>
> Other mentions:
> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>
> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> this there (but might have missed something!). From what I can see I
> assume this will likely be somewhat tricky to handle, as a revert
> overall might be a bad idea here. We'll see I guess.

The change aligns the driver what has been validated on each board
design.  Windows uses the same limits.  Using values lower than the
validated range can lead to undefined behavior and could potentially
damage your hardware.

Alex

>
> Roman posted something that apparently was meant to go to the list, so
> let me put it here:
>
> """
> UPDATE: User fililip already posted patch, but it need to be merged,
> discussion is on gitlab link below.
>
> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>
>
> > it seems that commit was already found(see user's 'fililip' comment):
> >
> > https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> > commit 1958946858a62b6b5392ed075aa219d199bcae39
> > Author: Ma Jun <Jun.Ma2@amd.com>
> > Date:   Thu Oct 12 09:33:45 2023 +0800
> >
> >     drm/amd/pm: Support for getting power1_cap_min value
> >
> >     Support for getting power1_cap_min value on smu13 and smu11.
> >     For other Asics, we still use 0 as the default value.
> >
> >     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> >     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> >     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >
> > However, this is not good as it remove under-powering range too far. I
> was getting only about 7% less performance but 90W(!) less consumption
> when set to my 115W before. Also I wonder if we as a OS of options and
> freedom have to stick to such very high reference for min values without
> ability to override them through some sys ctrls. Commit was done by amd
> guy and I wonder if because of maybe this post that I made few months
> ago(business strategy?):
> >
> >
> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> >
> > This is not a dangerous OC upwards where I can understand desire to
> protect HW, it is downward, having min cap at 190W when card pull on
> 115W almost same speed is IMO crazy to deny. We don't talk about default
> or reference values here either, just a move to lower the range of
> options for whatever reason.
> >
> > I don't know how much power you guys have over them, but please
> consider either reverting this change, or give us an option to set
> min_cap through say /sys (right now param is readonly, even for root).
> >
> >
> > Thank you in advance for looking into this, with regards:  Romano
> """
>
> And while at it, let me add this issue to the tracking as well
>
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot introduced 1958946858a62b /
> #regzbot title drm: amdgpu: under-powering broke
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 14:45     ` Alex Deucher
@ 2024-02-20 15:03       ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-20 15:15         ` Alex Deucher
  0 siblings, 1 reply; 28+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-20 15:03 UTC (permalink / raw)
  To: Alex Deucher, Linux regressions mailing list
  Cc: Alex Deucher, Christian König, Pan, Xinhui, Ma Jun, amd-gfx,
	Dave Airlie, Daniel Vetter, Greg KH, Roman Benes

On 20.02.24 15:45, Alex Deucher wrote:
> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> On 17.02.24 14:30, Greg KH wrote:
>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>> kernel and my 115W(under power)cap work again as before.
>>>
>> For the record and everyone that lands here: the cause is known now
>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>
>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>
>> Other mentions:
>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>
>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>> this there (but might have missed something!). From what I can see I
>> assume this will likely be somewhat tricky to handle, as a revert
>> overall might be a bad idea here. We'll see I guess.
> 
> The change aligns the driver what has been validated on each board
> design.  Windows uses the same limits.  Using values lower than the
> validated range can lead to undefined behavior and could potentially
> damage your hardware.

Thx for the reply! Yeah, I was expecting something along those lines.

Nevertheless it afaics still is a regression in the eyes of many users.
I'm not sure how Linus feels about this, but I wonder if we can find
some solution here so that users that really want to, can continue to do
what was possible out-of-the box before. Is that possible to realize or
even supported already?

And sure, those users would be running their hardware outside of its
specifications. But is that different from overclocking (which the
driver allows, doesn't it? If not by all means please correct me!)?

Ciao, Thorsten

>> Roman posted something that apparently was meant to go to the list, so
>> let me put it here:
>>
>> """
>> UPDATE: User fililip already posted patch, but it need to be merged,
>> discussion is on gitlab link below.
>>
>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>
>>
>>> it seems that commit was already found(see user's 'fililip' comment):
>>>
>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>
>>>     drm/amd/pm: Support for getting power1_cap_min value
>>>
>>>     Support for getting power1_cap_min value on smu13 and smu11.
>>>     For other Asics, we still use 0 as the default value.
>>>
>>>     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>
>>> However, this is not good as it remove under-powering range too far. I
>> was getting only about 7% less performance but 90W(!) less consumption
>> when set to my 115W before. Also I wonder if we as a OS of options and
>> freedom have to stick to such very high reference for min values without
>> ability to override them through some sys ctrls. Commit was done by amd
>> guy and I wonder if because of maybe this post that I made few months
>> ago(business strategy?):
>>>
>>>
>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>
>>> This is not a dangerous OC upwards where I can understand desire to
>> protect HW, it is downward, having min cap at 190W when card pull on
>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>> or reference values here either, just a move to lower the range of
>> options for whatever reason.
>>>
>>> I don't know how much power you guys have over them, but please
>> consider either reverting this change, or give us an option to set
>> min_cap through say /sys (right now param is readonly, even for root).
>>>
>>>
>>> Thank you in advance for looking into this, with regards:  Romano
>> """
>>
>> And while at it, let me add this issue to the tracking as well
>>
>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>> regressions; the text you find below is based on a few templates
>> paragraphs you might have encountered already in similar form.
>> See link in footer if these mails annoy you.]
>>
>> Thanks for the report. To be sure the issue doesn't fall through the
>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>> tracking bot:
>>
>> #regzbot introduced 1958946858a62b /
>> #regzbot title drm: amdgpu: under-powering broke
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 15:03       ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-20 15:15         ` Alex Deucher
  2024-02-20 15:26           ` Christian König
  2024-02-20 15:27           ` Hans de Goede
  0 siblings, 2 replies; 28+ messages in thread
From: Alex Deucher @ 2024-02-20 15:15 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Alex Deucher, Christian König, Pan, Xinhui, Ma Jun, amd-gfx,
	Dave Airlie, Daniel Vetter, Greg KH, Roman Benes

On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> On 20.02.24 15:45, Alex Deucher wrote:
> > On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> > Leemhuis) <regressions@leemhuis.info> wrote:
> >>
> >> On 17.02.24 14:30, Greg KH wrote:
> >>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> >>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> >>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> >>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> >>>> even for root. This is not of above apps issue but of the kernel, I read
> >>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> >>>> kernel and my 115W(under power)cap work again as before.
> >>>
> >> For the record and everyone that lands here: the cause is known now
> >> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> >> value") [v6.7-rc1]) and the issue afaics tracked here:
> >>
> >> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>
> >> Other mentions:
> >> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> >> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
> >>
> >> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> >> this there (but might have missed something!). From what I can see I
> >> assume this will likely be somewhat tricky to handle, as a revert
> >> overall might be a bad idea here. We'll see I guess.
> >
> > The change aligns the driver what has been validated on each board
> > design.  Windows uses the same limits.  Using values lower than the
> > validated range can lead to undefined behavior and could potentially
> > damage your hardware.
>
> Thx for the reply! Yeah, I was expecting something along those lines.
>
> Nevertheless it afaics still is a regression in the eyes of many users.
> I'm not sure how Linus feels about this, but I wonder if we can find
> some solution here so that users that really want to, can continue to do
> what was possible out-of-the box before. Is that possible to realize or
> even supported already?
>
> And sure, those users would be running their hardware outside of its
> specifications. But is that different from overclocking (which the
> driver allows, doesn't it? If not by all means please correct me!)?

Sure.  The driver has always had upper bound limits for overclocking,
this change adds lower bounds checking for underclocking as well.
When the silicon validation teams set the bounding box for a device,
they set a range of values where it's reasonable to operate based on
the characteristics of the design.

If we did want to allow extended underclocking, we need a big warning
in the logs at the very least.

Alex

>
> Ciao, Thorsten
>
> >> Roman posted something that apparently was meant to go to the list, so
> >> let me put it here:
> >>
> >> """
> >> UPDATE: User fililip already posted patch, but it need to be merged,
> >> discussion is on gitlab link below.
> >>
> >> (PS: I hope I am replying correctly to "all" now? - using original addr.)
> >>
> >>
> >>> it seems that commit was already found(see user's 'fililip' comment):
> >>>
> >>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>> commit 1958946858a62b6b5392ed075aa219d199bcae39
> >>> Author: Ma Jun <Jun.Ma2@amd.com>
> >>> Date:   Thu Oct 12 09:33:45 2023 +0800
> >>>
> >>>     drm/amd/pm: Support for getting power1_cap_min value
> >>>
> >>>     Support for getting power1_cap_min value on smu13 and smu11.
> >>>     For other Asics, we still use 0 as the default value.
> >>>
> >>>     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> >>>     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> >>>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>>
> >>> However, this is not good as it remove under-powering range too far. I
> >> was getting only about 7% less performance but 90W(!) less consumption
> >> when set to my 115W before. Also I wonder if we as a OS of options and
> >> freedom have to stick to such very high reference for min values without
> >> ability to override them through some sys ctrls. Commit was done by amd
> >> guy and I wonder if because of maybe this post that I made few months
> >> ago(business strategy?):
> >>>
> >>>
> >> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> >>>
> >>> This is not a dangerous OC upwards where I can understand desire to
> >> protect HW, it is downward, having min cap at 190W when card pull on
> >> 115W almost same speed is IMO crazy to deny. We don't talk about default
> >> or reference values here either, just a move to lower the range of
> >> options for whatever reason.
> >>>
> >>> I don't know how much power you guys have over them, but please
> >> consider either reverting this change, or give us an option to set
> >> min_cap through say /sys (right now param is readonly, even for root).
> >>>
> >>>
> >>> Thank you in advance for looking into this, with regards:  Romano
> >> """
> >>
> >> And while at it, let me add this issue to the tracking as well
> >>
> >> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >> regressions; the text you find below is based on a few templates
> >> paragraphs you might have encountered already in similar form.
> >> See link in footer if these mails annoy you.]
> >>
> >> Thanks for the report. To be sure the issue doesn't fall through the
> >> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >> tracking bot:
> >>
> >> #regzbot introduced 1958946858a62b /
> >> #regzbot title drm: amdgpu: under-powering broke
> >>
> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >> --
> >> Everything you wanna know about Linux kernel regression tracking:
> >> https://linux-regtracking.leemhuis.info/about/#tldr
> >> That page also explains what to do if mails like this annoy you.
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 15:15         ` Alex Deucher
@ 2024-02-20 15:26           ` Christian König
  2024-02-20 15:27           ` Hans de Goede
  1 sibling, 0 replies; 28+ messages in thread
From: Christian König @ 2024-02-20 15:26 UTC (permalink / raw)
  To: Alex Deucher, Linux regressions mailing list
  Cc: Alex Deucher, Pan, Xinhui, Ma Jun, amd-gfx, Dave Airlie,
	Daniel Vetter, Greg KH, Roman Benes

Am 20.02.24 um 16:15 schrieb Alex Deucher:
> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>> On 20.02.24 15:45, Alex Deucher wrote:
>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>> kernel and my 115W(under power)cap work again as before.
>>>> For the record and everyone that lands here: the cause is known now
>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>
>>>> Other mentions:
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>
>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>> this there (but might have missed something!). From what I can see I
>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>> overall might be a bad idea here. We'll see I guess.
>>> The change aligns the driver what has been validated on each board
>>> design.  Windows uses the same limits.  Using values lower than the
>>> validated range can lead to undefined behavior and could potentially
>>> damage your hardware.
>> Thx for the reply! Yeah, I was expecting something along those lines.
>>
>> Nevertheless it afaics still is a regression in the eyes of many users.
>> I'm not sure how Linus feels about this, but I wonder if we can find
>> some solution here so that users that really want to, can continue to do
>> what was possible out-of-the box before. Is that possible to realize or
>> even supported already?
>>
>> And sure, those users would be running their hardware outside of its
>> specifications. But is that different from overclocking (which the
>> driver allows, doesn't it? If not by all means please correct me!)?
> Sure.  The driver has always had upper bound limits for overclocking,
> this change adds lower bounds checking for underclocking as well.
> When the silicon validation teams set the bounding box for a device,
> they set a range of values where it's reasonable to operate based on
> the characteristics of the design.
>
> If we did want to allow extended underclocking, we need a big warning
> in the logs at the very least.

Yeah, I mean we had a similar outcry when we started to apply the limits 
for the display PLLs as well.

It's just that we have to stay inside certain parameters to be allowed 
as hardware vendor to sell the stuff in most countries because of public 
regulations.

I mean you can in theory program the ASIC so that it starts sucking more 
power than allowed through the PCIe lanes which could start a fire. 
Because of that certain settings are protected by signed firmware images.

Undervolting is not that problematic than overclocking or overvolting, 
but you can still do stuff which is outside the hardware specification 
with that.

Regards,
Christian.

>
> Alex
>
>> Ciao, Thorsten
>>
>>>> Roman posted something that apparently was meant to go to the list, so
>>>> let me put it here:
>>>>
>>>> """
>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>> discussion is on gitlab link below.
>>>>
>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>
>>>>
>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>
>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>
>>>>>      drm/amd/pm: Support for getting power1_cap_min value
>>>>>
>>>>>      Support for getting power1_cap_min value on smu13 and smu11.
>>>>>      For other Asics, we still use 0 as the default value.
>>>>>
>>>>>      Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>      Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>      Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>
>>>>> However, this is not good as it remove under-powering range too far. I
>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>> freedom have to stick to such very high reference for min values without
>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>> guy and I wonder if because of maybe this post that I made few months
>>>> ago(business strategy?):
>>>>>
>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>> or reference values here either, just a move to lower the range of
>>>> options for whatever reason.
>>>>> I don't know how much power you guys have over them, but please
>>>> consider either reverting this change, or give us an option to set
>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>
>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>> """
>>>>
>>>> And while at it, let me add this issue to the tracking as well
>>>>
>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>> regressions; the text you find below is based on a few templates
>>>> paragraphs you might have encountered already in similar form.
>>>> See link in footer if these mails annoy you.]
>>>>
>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>> tracking bot:
>>>>
>>>> #regzbot introduced 1958946858a62b /
>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>> --
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> That page also explains what to do if mails like this annoy you.
>>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 15:15         ` Alex Deucher
  2024-02-20 15:26           ` Christian König
@ 2024-02-20 15:27           ` Hans de Goede
  2024-02-20 15:42             ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-20 18:14             ` Alex Deucher
  1 sibling, 2 replies; 28+ messages in thread
From: Hans de Goede @ 2024-02-20 15:27 UTC (permalink / raw)
  To: Alex Deucher, Linux regressions mailing list
  Cc: Alex Deucher, Christian König, Pan, Xinhui, Ma Jun, amd-gfx,
	Dave Airlie, Daniel Vetter, Greg KH, Roman Benes

Hi,

On 2/20/24 16:15, Alex Deucher wrote:
> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> On 20.02.24 15:45, Alex Deucher wrote:
>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>
>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>
>>>> For the record and everyone that lands here: the cause is known now
>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>
>>>> Other mentions:
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>
>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>> this there (but might have missed something!). From what I can see I
>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>> overall might be a bad idea here. We'll see I guess.
>>>
>>> The change aligns the driver what has been validated on each board
>>> design.  Windows uses the same limits.  Using values lower than the
>>> validated range can lead to undefined behavior and could potentially
>>> damage your hardware.
>>
>> Thx for the reply! Yeah, I was expecting something along those lines.
>>
>> Nevertheless it afaics still is a regression in the eyes of many users.
>> I'm not sure how Linus feels about this, but I wonder if we can find
>> some solution here so that users that really want to, can continue to do
>> what was possible out-of-the box before. Is that possible to realize or
>> even supported already?
>>
>> And sure, those users would be running their hardware outside of its
>> specifications. But is that different from overclocking (which the
>> driver allows, doesn't it? If not by all means please correct me!)?
> 
> Sure.  The driver has always had upper bound limits for overclocking,
> this change adds lower bounds checking for underclocking as well.
> When the silicon validation teams set the bounding box for a device,
> they set a range of values where it's reasonable to operate based on
> the characteristics of the design.
> 
> If we did want to allow extended underclocking, we need a big warning
> in the logs at the very least.

Requiring a module-option to be set to allow this, as well as a big
warning in the logs sounds like a good solution to me.

Regards,

Hans





>>>> Roman posted something that apparently was meant to go to the list, so
>>>> let me put it here:
>>>>
>>>> """
>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>> discussion is on gitlab link below.
>>>>
>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>
>>>>
>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>
>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>
>>>>>     drm/amd/pm: Support for getting power1_cap_min value
>>>>>
>>>>>     Support for getting power1_cap_min value on smu13 and smu11.
>>>>>     For other Asics, we still use 0 as the default value.
>>>>>
>>>>>     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>
>>>>> However, this is not good as it remove under-powering range too far. I
>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>> freedom have to stick to such very high reference for min values without
>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>> guy and I wonder if because of maybe this post that I made few months
>>>> ago(business strategy?):
>>>>>
>>>>>
>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>
>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>> or reference values here either, just a move to lower the range of
>>>> options for whatever reason.
>>>>>
>>>>> I don't know how much power you guys have over them, but please
>>>> consider either reverting this change, or give us an option to set
>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>
>>>>>
>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>> """
>>>>
>>>> And while at it, let me add this issue to the tracking as well
>>>>
>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>> regressions; the text you find below is based on a few templates
>>>> paragraphs you might have encountered already in similar form.
>>>> See link in footer if these mails annoy you.]
>>>>
>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>> tracking bot:
>>>>
>>>> #regzbot introduced 1958946858a62b /
>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>> --
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> That page also explains what to do if mails like this annoy you.
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 15:27           ` Hans de Goede
@ 2024-02-20 15:42             ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-20 15:46               ` Alex Deucher
  2024-02-20 18:14             ` Alex Deucher
  1 sibling, 1 reply; 28+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-20 15:42 UTC (permalink / raw)
  To: Hans de Goede, Alex Deucher, Linux regressions mailing list
  Cc: Alex Deucher, Christian König, Pan, Xinhui, Ma Jun, amd-gfx,
	Dave Airlie, Daniel Vetter, Greg KH, Roman Benes



On 20.02.24 16:27, Hans de Goede wrote:
> Hi,
> 
> On 2/20/24 16:15, Alex Deucher wrote:
>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>
>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>
>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>
>>>>> For the record and everyone that lands here: the cause is known now
>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>
>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>
>>>>> Other mentions:
>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>
>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>> this there (but might have missed something!). From what I can see I
>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>> overall might be a bad idea here. We'll see I guess.
>>>>
>>>> The change aligns the driver what has been validated on each board
>>>> design.  Windows uses the same limits.  Using values lower than the
>>>> validated range can lead to undefined behavior and could potentially
>>>> damage your hardware.
>>>
>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>
>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>> some solution here so that users that really want to, can continue to do
>>> what was possible out-of-the box before. Is that possible to realize or
>>> even supported already?
>>>
>>> And sure, those users would be running their hardware outside of its
>>> specifications. But is that different from overclocking (which the
>>> driver allows, doesn't it? If not by all means please correct me!)?
>>
>> Sure.  The driver has always had upper bound limits for overclocking,
>> this change adds lower bounds checking for underclocking as well.
>> When the silicon validation teams set the bounding box for a device,
>> they set a range of values where it's reasonable to operate based on
>> the characteristics of the design.
>>
>> If we did want to allow extended underclocking, we need a big warning
>> in the logs at the very least.
> 
> Requiring a module-option to be set to allow this, as well as a big
> warning in the logs sounds like a good solution to me.

Yeah, especially as it sounds from some of the reports as if some
vendors did a really bad job when it came to setting the proper
lower-bound limits are now adhered -- and thus higher then what we used
out-of-the box before 1958946858a62b was applied.

Side note: I assume those "lower bounds checking" is done round about
the same way by the Windows driver? Does that one allow users to go
lower somehow? Say after modifying the registry or something like that?
Or through external tools?

Ciao, Thorsten

>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>> let me put it here:
>>>>>
>>>>> """
>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>> discussion is on gitlab link below.
>>>>>
>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>
>>>>>
>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>
>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>
>>>>>>     drm/amd/pm: Support for getting power1_cap_min value
>>>>>>
>>>>>>     Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>     For other Asics, we still use 0 as the default value.
>>>>>>
>>>>>>     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>
>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>> freedom have to stick to such very high reference for min values without
>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>> ago(business strategy?):
>>>>>>
>>>>>>
>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>
>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>> or reference values here either, just a move to lower the range of
>>>>> options for whatever reason.
>>>>>>
>>>>>> I don't know how much power you guys have over them, but please
>>>>> consider either reverting this change, or give us an option to set
>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>
>>>>>>
>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>> """
>>>>>
>>>>> And while at it, let me add this issue to the tracking as well
>>>>>
>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>> regressions; the text you find below is based on a few templates
>>>>> paragraphs you might have encountered already in similar form.
>>>>> See link in footer if these mails annoy you.]
>>>>>
>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>> tracking bot:
>>>>>
>>>>> #regzbot introduced 1958946858a62b /
>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>
>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>> --
>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>> That page also explains what to do if mails like this annoy you.
>>>>
>>>>
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 15:42             ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-20 15:46               ` Alex Deucher
  2024-02-20 16:46                 ` Romano
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Deucher @ 2024-02-20 15:46 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Hans de Goede, Alex Deucher, Christian König, Pan, Xinhui,
	Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH,
	Roman Benes

On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
>
>
> On 20.02.24 16:27, Hans de Goede wrote:
> > Hi,
> >
> > On 2/20/24 16:15, Alex Deucher wrote:
> >> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> >> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>
> >>> On 20.02.24 15:45, Alex Deucher wrote:
> >>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> >>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>>
> >>>>> On 17.02.24 14:30, Greg KH wrote:
> >>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> >>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> >>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> >>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> >>>>>>> even for root. This is not of above apps issue but of the kernel, I read
> >>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> >>>>>>> kernel and my 115W(under power)cap work again as before.
> >>>>>>
> >>>>> For the record and everyone that lands here: the cause is known now
> >>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> >>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
> >>>>>
> >>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>
> >>>>> Other mentions:
> >>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> >>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
> >>>>>
> >>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> >>>>> this there (but might have missed something!). From what I can see I
> >>>>> assume this will likely be somewhat tricky to handle, as a revert
> >>>>> overall might be a bad idea here. We'll see I guess.
> >>>>
> >>>> The change aligns the driver what has been validated on each board
> >>>> design.  Windows uses the same limits.  Using values lower than the
> >>>> validated range can lead to undefined behavior and could potentially
> >>>> damage your hardware.
> >>>
> >>> Thx for the reply! Yeah, I was expecting something along those lines.
> >>>
> >>> Nevertheless it afaics still is a regression in the eyes of many users.
> >>> I'm not sure how Linus feels about this, but I wonder if we can find
> >>> some solution here so that users that really want to, can continue to do
> >>> what was possible out-of-the box before. Is that possible to realize or
> >>> even supported already?
> >>>
> >>> And sure, those users would be running their hardware outside of its
> >>> specifications. But is that different from overclocking (which the
> >>> driver allows, doesn't it? If not by all means please correct me!)?
> >>
> >> Sure.  The driver has always had upper bound limits for overclocking,
> >> this change adds lower bounds checking for underclocking as well.
> >> When the silicon validation teams set the bounding box for a device,
> >> they set a range of values where it's reasonable to operate based on
> >> the characteristics of the design.
> >>
> >> If we did want to allow extended underclocking, we need a big warning
> >> in the logs at the very least.
> >
> > Requiring a module-option to be set to allow this, as well as a big
> > warning in the logs sounds like a good solution to me.
>
> Yeah, especially as it sounds from some of the reports as if some
> vendors did a really bad job when it came to setting the proper
> lower-bound limits are now adhered -- and thus higher then what we used
> out-of-the box before 1958946858a62b was applied.
>
> Side note: I assume those "lower bounds checking" is done round about
> the same way by the Windows driver? Does that one allow users to go
> lower somehow? Say after modifying the registry or something like that?
> Or through external tools?

Windows uses the same limit.  I'm not aware of any way to override the
limit on windows off hand.

Alex


>
> Ciao, Thorsten
>
> >>>>> Roman posted something that apparently was meant to go to the list, so
> >>>>> let me put it here:
> >>>>>
> >>>>> """
> >>>>> UPDATE: User fililip already posted patch, but it need to be merged,
> >>>>> discussion is on gitlab link below.
> >>>>>
> >>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
> >>>>>
> >>>>>
> >>>>>> it seems that commit was already found(see user's 'fililip' comment):
> >>>>>>
> >>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
> >>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
> >>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
> >>>>>>
> >>>>>>     drm/amd/pm: Support for getting power1_cap_min value
> >>>>>>
> >>>>>>     Support for getting power1_cap_min value on smu13 and smu11.
> >>>>>>     For other Asics, we still use 0 as the default value.
> >>>>>>
> >>>>>>     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> >>>>>>     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> >>>>>>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>>>>>
> >>>>>> However, this is not good as it remove under-powering range too far. I
> >>>>> was getting only about 7% less performance but 90W(!) less consumption
> >>>>> when set to my 115W before. Also I wonder if we as a OS of options and
> >>>>> freedom have to stick to such very high reference for min values without
> >>>>> ability to override them through some sys ctrls. Commit was done by amd
> >>>>> guy and I wonder if because of maybe this post that I made few months
> >>>>> ago(business strategy?):
> >>>>>>
> >>>>>>
> >>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> >>>>>>
> >>>>>> This is not a dangerous OC upwards where I can understand desire to
> >>>>> protect HW, it is downward, having min cap at 190W when card pull on
> >>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
> >>>>> or reference values here either, just a move to lower the range of
> >>>>> options for whatever reason.
> >>>>>>
> >>>>>> I don't know how much power you guys have over them, but please
> >>>>> consider either reverting this change, or give us an option to set
> >>>>> min_cap through say /sys (right now param is readonly, even for root).
> >>>>>>
> >>>>>>
> >>>>>> Thank you in advance for looking into this, with regards:  Romano
> >>>>> """
> >>>>>
> >>>>> And while at it, let me add this issue to the tracking as well
> >>>>>
> >>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >>>>> regressions; the text you find below is based on a few templates
> >>>>> paragraphs you might have encountered already in similar form.
> >>>>> See link in footer if these mails annoy you.]
> >>>>>
> >>>>> Thanks for the report. To be sure the issue doesn't fall through the
> >>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >>>>> tracking bot:
> >>>>>
> >>>>> #regzbot introduced 1958946858a62b /
> >>>>> #regzbot title drm: amdgpu: under-powering broke
> >>>>>
> >>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>> --
> >>>>> Everything you wanna know about Linux kernel regression tracking:
> >>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>>> That page also explains what to do if mails like this annoy you.
> >>>>
> >>>>
> >>
> >
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 15:46               ` Alex Deucher
@ 2024-02-20 16:46                 ` Romano
  2024-02-20 18:09                   ` Alex Deucher
  0 siblings, 1 reply; 28+ messages in thread
From: Romano @ 2024-02-20 16:46 UTC (permalink / raw)
  To: Alex Deucher, Linux regressions mailing list
  Cc: Hans de Goede, Alex Deucher, Christian König, Pan, Xinhui,
	Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

For Windows, apps like MSI Afterburner is the one to try and what most 
people go for. Using it in the past myself, I would be surprised if it 
adhered to such a high min power cap. But even if it did, why would we 
have to.

Relying on vendors cap in this case has already proven wrong because 
things worked for quite some time already and people reported saving 
significant amount of watts, in my case 90W(!) for <10% perf.

Therefore this talk about safety seems rather strange to me and 
especially so when we are talking about min_cap. Or name me a single 
case where someone fried his card due to "too low power" set in said 
variable. Now there was a report, where by going way too low, driver 
goes opposite into max power. That's it. That can be easily 
detected(vents going crazy etc.) and reverted. It is a max_cap that 
protect HW(also above scenario), not a min_cap. Feel free to adhere to 
safety standards with that one.

As for solution, what some suggested already exist - a patch posted by 
fililip on gitlab is probably the way most of you would agree. It 
introduce a variable that can be set during boot to override min_cap. 
But he did not pull requested it, so please, if any one of you who have 
access to code and merge kernel would be kind enough to implement it.



On 2/20/24 16:46, Alex Deucher wrote:
> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>>
>>
>> On 20.02.24 16:27, Hans de Goede wrote:
>>> Hi,
>>>
>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>> For the record and everyone that lands here: the cause is known now
>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>
>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>
>>>>>>> Other mentions:
>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>
>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>>>> this there (but might have missed something!). From what I can see I
>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>> The change aligns the driver what has been validated on each board
>>>>>> design.  Windows uses the same limits.  Using values lower than the
>>>>>> validated range can lead to undefined behavior and could potentially
>>>>>> damage your hardware.
>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>>>
>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>>>> some solution here so that users that really want to, can continue to do
>>>>> what was possible out-of-the box before. Is that possible to realize or
>>>>> even supported already?
>>>>>
>>>>> And sure, those users would be running their hardware outside of its
>>>>> specifications. But is that different from overclocking (which the
>>>>> driver allows, doesn't it? If not by all means please correct me!)?
>>>> Sure.  The driver has always had upper bound limits for overclocking,
>>>> this change adds lower bounds checking for underclocking as well.
>>>> When the silicon validation teams set the bounding box for a device,
>>>> they set a range of values where it's reasonable to operate based on
>>>> the characteristics of the design.
>>>>
>>>> If we did want to allow extended underclocking, we need a big warning
>>>> in the logs at the very least.
>>> Requiring a module-option to be set to allow this, as well as a big
>>> warning in the logs sounds like a good solution to me.
>> Yeah, especially as it sounds from some of the reports as if some
>> vendors did a really bad job when it came to setting the proper
>> lower-bound limits are now adhered -- and thus higher then what we used
>> out-of-the box before 1958946858a62b was applied.
>>
>> Side note: I assume those "lower bounds checking" is done round about
>> the same way by the Windows driver? Does that one allow users to go
>> lower somehow? Say after modifying the registry or something like that?
>> Or through external tools?
> Windows uses the same limit.  I'm not aware of any way to override the
> limit on windows off hand.
>
> Alex
>
>
>> Ciao, Thorsten
>>
>>>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>>>> let me put it here:
>>>>>>>
>>>>>>> """
>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>>>> discussion is on gitlab link below.
>>>>>>>
>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>>>
>>>>>>>
>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>>>
>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>
>>>>>>>>      drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>
>>>>>>>>      Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>>>      For other Asics, we still use 0 as the default value.
>>>>>>>>
>>>>>>>>      Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>      Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>      Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>>
>>>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>>>> freedom have to stick to such very high reference for min values without
>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>>>> ago(business strategy?):
>>>>>>>>
>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>>>> or reference values here either, just a move to lower the range of
>>>>>>> options for whatever reason.
>>>>>>>> I don't know how much power you guys have over them, but please
>>>>>>> consider either reverting this change, or give us an option to set
>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>>>
>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>>>> """
>>>>>>>
>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>
>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>>>> regressions; the text you find below is based on a few templates
>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>
>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>>> tracking bot:
>>>>>>>
>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>
>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>> --
>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 16:46                 ` Romano
@ 2024-02-20 18:09                   ` Alex Deucher
  2024-02-20 19:41                     ` Romano
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Deucher @ 2024-02-20 18:09 UTC (permalink / raw)
  To: Romano
  Cc: Linux regressions mailing list, Hans de Goede, Alex Deucher,
	Christian König, Pan, Xinhui, Ma Jun, amd-gfx, Dave Airlie,
	Daniel Vetter, Greg KH

On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>
> For Windows, apps like MSI Afterburner is the one to try and what most
> people go for. Using it in the past myself, I would be surprised if it
> adhered to such a high min power cap. But even if it did, why would we
> have to.
>
> Relying on vendors cap in this case has already proven wrong because
> things worked for quite some time already and people reported saving
> significant amount of watts, in my case 90W(!) for <10% perf.
>
> Therefore this talk about safety seems rather strange to me and
> especially so when we are talking about min_cap. Or name me a single
> case where someone fried his card due to "too low power" set in said
> variable. Now there was a report, where by going way too low, driver
> goes opposite into max power. That's it. That can be easily
> detected(vents going crazy etc.) and reverted. It is a max_cap that
> protect HW(also above scenario), not a min_cap. Feel free to adhere to
> safety standards with that one.

Because operation outside of the design bounding box is undefined.  It
might work for some boards but not others.  It's possible some of the
logic in the firmware or some of the components used on the board may
not work correctly below a certain limit, or the voltage regulators
used on a specific board have a minimum requirement that would not be
an issue if you stick the bounding box.

Alex

>
> As for solution, what some suggested already exist - a patch posted by
> fililip on gitlab is probably the way most of you would agree. It
> introduce a variable that can be set during boot to override min_cap.
> But he did not pull requested it, so please, if any one of you who have
> access to code and merge kernel would be kind enough to implement it.
>
>
>
> On 2/20/24 16:46, Alex Deucher wrote:
> > On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
> > Leemhuis) <regressions@leemhuis.info> wrote:
> >>
> >>
> >> On 20.02.24 16:27, Hans de Goede wrote:
> >>> Hi,
> >>>
> >>> On 2/20/24 16:15, Alex Deucher wrote:
> >>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> >>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>> On 20.02.24 15:45, Alex Deucher wrote:
> >>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> >>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>>>> On 17.02.24 14:30, Greg KH wrote:
> >>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> >>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> >>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> >>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> >>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
> >>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> >>>>>>>>> kernel and my 115W(under power)cap work again as before.
> >>>>>>> For the record and everyone that lands here: the cause is known now
> >>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> >>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
> >>>>>>>
> >>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>>>
> >>>>>>> Other mentions:
> >>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> >>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
> >>>>>>>
> >>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> >>>>>>> this there (but might have missed something!). From what I can see I
> >>>>>>> assume this will likely be somewhat tricky to handle, as a revert
> >>>>>>> overall might be a bad idea here. We'll see I guess.
> >>>>>> The change aligns the driver what has been validated on each board
> >>>>>> design.  Windows uses the same limits.  Using values lower than the
> >>>>>> validated range can lead to undefined behavior and could potentially
> >>>>>> damage your hardware.
> >>>>> Thx for the reply! Yeah, I was expecting something along those lines.
> >>>>>
> >>>>> Nevertheless it afaics still is a regression in the eyes of many users.
> >>>>> I'm not sure how Linus feels about this, but I wonder if we can find
> >>>>> some solution here so that users that really want to, can continue to do
> >>>>> what was possible out-of-the box before. Is that possible to realize or
> >>>>> even supported already?
> >>>>>
> >>>>> And sure, those users would be running their hardware outside of its
> >>>>> specifications. But is that different from overclocking (which the
> >>>>> driver allows, doesn't it? If not by all means please correct me!)?
> >>>> Sure.  The driver has always had upper bound limits for overclocking,
> >>>> this change adds lower bounds checking for underclocking as well.
> >>>> When the silicon validation teams set the bounding box for a device,
> >>>> they set a range of values where it's reasonable to operate based on
> >>>> the characteristics of the design.
> >>>>
> >>>> If we did want to allow extended underclocking, we need a big warning
> >>>> in the logs at the very least.
> >>> Requiring a module-option to be set to allow this, as well as a big
> >>> warning in the logs sounds like a good solution to me.
> >> Yeah, especially as it sounds from some of the reports as if some
> >> vendors did a really bad job when it came to setting the proper
> >> lower-bound limits are now adhered -- and thus higher then what we used
> >> out-of-the box before 1958946858a62b was applied.
> >>
> >> Side note: I assume those "lower bounds checking" is done round about
> >> the same way by the Windows driver? Does that one allow users to go
> >> lower somehow? Say after modifying the registry or something like that?
> >> Or through external tools?
> > Windows uses the same limit.  I'm not aware of any way to override the
> > limit on windows off hand.
> >
> > Alex
> >
> >
> >> Ciao, Thorsten
> >>
> >>>>>>> Roman posted something that apparently was meant to go to the list, so
> >>>>>>> let me put it here:
> >>>>>>>
> >>>>>>> """
> >>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
> >>>>>>> discussion is on gitlab link below.
> >>>>>>>
> >>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
> >>>>>>>
> >>>>>>>
> >>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
> >>>>>>>>
> >>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
> >>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
> >>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
> >>>>>>>>
> >>>>>>>>      drm/amd/pm: Support for getting power1_cap_min value
> >>>>>>>>
> >>>>>>>>      Support for getting power1_cap_min value on smu13 and smu11.
> >>>>>>>>      For other Asics, we still use 0 as the default value.
> >>>>>>>>
> >>>>>>>>      Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> >>>>>>>>      Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> >>>>>>>>      Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>>>>>>>
> >>>>>>>> However, this is not good as it remove under-powering range too far. I
> >>>>>>> was getting only about 7% less performance but 90W(!) less consumption
> >>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
> >>>>>>> freedom have to stick to such very high reference for min values without
> >>>>>>> ability to override them through some sys ctrls. Commit was done by amd
> >>>>>>> guy and I wonder if because of maybe this post that I made few months
> >>>>>>> ago(business strategy?):
> >>>>>>>>
> >>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> >>>>>>>> This is not a dangerous OC upwards where I can understand desire to
> >>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
> >>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
> >>>>>>> or reference values here either, just a move to lower the range of
> >>>>>>> options for whatever reason.
> >>>>>>>> I don't know how much power you guys have over them, but please
> >>>>>>> consider either reverting this change, or give us an option to set
> >>>>>>> min_cap through say /sys (right now param is readonly, even for root).
> >>>>>>>>
> >>>>>>>> Thank you in advance for looking into this, with regards:  Romano
> >>>>>>> """
> >>>>>>>
> >>>>>>> And while at it, let me add this issue to the tracking as well
> >>>>>>>
> >>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >>>>>>> regressions; the text you find below is based on a few templates
> >>>>>>> paragraphs you might have encountered already in similar form.
> >>>>>>> See link in footer if these mails annoy you.]
> >>>>>>>
> >>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
> >>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >>>>>>> tracking bot:
> >>>>>>>
> >>>>>>> #regzbot introduced 1958946858a62b /
> >>>>>>> #regzbot title drm: amdgpu: under-powering broke
> >>>>>>>
> >>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>>>> --
> >>>>>>> Everything you wanna know about Linux kernel regression tracking:
> >>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>>>>> That page also explains what to do if mails like this annoy you.
> >>>>>>
> >>>
> >>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 15:27           ` Hans de Goede
  2024-02-20 15:42             ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-20 18:14             ` Alex Deucher
  1 sibling, 0 replies; 28+ messages in thread
From: Alex Deucher @ 2024-02-20 18:14 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Linux regressions mailing list, Alex Deucher,
	Christian König, Pan, Xinhui, Ma Jun, amd-gfx, Dave Airlie,
	Daniel Vetter, Greg KH, Roman Benes

On Tue, Feb 20, 2024 at 10:27 AM Hans de Goede <hdegoede@redhat.com> wrote:
>
> Hi,
>
> On 2/20/24 16:15, Alex Deucher wrote:
> > On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> > Leemhuis) <regressions@leemhuis.info> wrote:
> >>
> >> On 20.02.24 15:45, Alex Deucher wrote:
> >>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> >>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>
> >>>> On 17.02.24 14:30, Greg KH wrote:
> >>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> >>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> >>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> >>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> >>>>>> even for root. This is not of above apps issue but of the kernel, I read
> >>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> >>>>>> kernel and my 115W(under power)cap work again as before.
> >>>>>
> >>>> For the record and everyone that lands here: the cause is known now
> >>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> >>>> value") [v6.7-rc1]) and the issue afaics tracked here:
> >>>>
> >>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>
> >>>> Other mentions:
> >>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> >>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
> >>>>
> >>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> >>>> this there (but might have missed something!). From what I can see I
> >>>> assume this will likely be somewhat tricky to handle, as a revert
> >>>> overall might be a bad idea here. We'll see I guess.
> >>>
> >>> The change aligns the driver what has been validated on each board
> >>> design.  Windows uses the same limits.  Using values lower than the
> >>> validated range can lead to undefined behavior and could potentially
> >>> damage your hardware.
> >>
> >> Thx for the reply! Yeah, I was expecting something along those lines.
> >>
> >> Nevertheless it afaics still is a regression in the eyes of many users.
> >> I'm not sure how Linus feels about this, but I wonder if we can find
> >> some solution here so that users that really want to, can continue to do
> >> what was possible out-of-the box before. Is that possible to realize or
> >> even supported already?
> >>
> >> And sure, those users would be running their hardware outside of its
> >> specifications. But is that different from overclocking (which the
> >> driver allows, doesn't it? If not by all means please correct me!)?
> >
> > Sure.  The driver has always had upper bound limits for overclocking,
> > this change adds lower bounds checking for underclocking as well.
> > When the silicon validation teams set the bounding box for a device,
> > they set a range of values where it's reasonable to operate based on
> > the characteristics of the design.
> >
> > If we did want to allow extended underclocking, we need a big warning
> > in the logs at the very least.
>
> Requiring a module-option to be set to allow this, as well as a big
> warning in the logs sounds like a good solution to me.

I dunno.  I kind of go back and forth with it.  It's yet another knob
to maintain and when we've done things like this in the past, we get
lots of bug reports or angry users because the kernel is sending
warnings when they set it.

Alex

>
> Regards,
>
> Hans
>
>
>
>
>
> >>>> Roman posted something that apparently was meant to go to the list, so
> >>>> let me put it here:
> >>>>
> >>>> """
> >>>> UPDATE: User fililip already posted patch, but it need to be merged,
> >>>> discussion is on gitlab link below.
> >>>>
> >>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
> >>>>
> >>>>
> >>>>> it seems that commit was already found(see user's 'fililip' comment):
> >>>>>
> >>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
> >>>>> Author: Ma Jun <Jun.Ma2@amd.com>
> >>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
> >>>>>
> >>>>>     drm/amd/pm: Support for getting power1_cap_min value
> >>>>>
> >>>>>     Support for getting power1_cap_min value on smu13 and smu11.
> >>>>>     For other Asics, we still use 0 as the default value.
> >>>>>
> >>>>>     Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> >>>>>     Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> >>>>>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>>>>
> >>>>> However, this is not good as it remove under-powering range too far. I
> >>>> was getting only about 7% less performance but 90W(!) less consumption
> >>>> when set to my 115W before. Also I wonder if we as a OS of options and
> >>>> freedom have to stick to such very high reference for min values without
> >>>> ability to override them through some sys ctrls. Commit was done by amd
> >>>> guy and I wonder if because of maybe this post that I made few months
> >>>> ago(business strategy?):
> >>>>>
> >>>>>
> >>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> >>>>>
> >>>>> This is not a dangerous OC upwards where I can understand desire to
> >>>> protect HW, it is downward, having min cap at 190W when card pull on
> >>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
> >>>> or reference values here either, just a move to lower the range of
> >>>> options for whatever reason.
> >>>>>
> >>>>> I don't know how much power you guys have over them, but please
> >>>> consider either reverting this change, or give us an option to set
> >>>> min_cap through say /sys (right now param is readonly, even for root).
> >>>>>
> >>>>>
> >>>>> Thank you in advance for looking into this, with regards:  Romano
> >>>> """
> >>>>
> >>>> And while at it, let me add this issue to the tracking as well
> >>>>
> >>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >>>> regressions; the text you find below is based on a few templates
> >>>> paragraphs you might have encountered already in similar form.
> >>>> See link in footer if these mails annoy you.]
> >>>>
> >>>> Thanks for the report. To be sure the issue doesn't fall through the
> >>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >>>> tracking bot:
> >>>>
> >>>> #regzbot introduced 1958946858a62b /
> >>>> #regzbot title drm: amdgpu: under-powering broke
> >>>>
> >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>> --
> >>>> Everything you wanna know about Linux kernel regression tracking:
> >>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>> That page also explains what to do if mails like this annoy you.
> >>>
> >>>
> >
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 18:09                   ` Alex Deucher
@ 2024-02-20 19:41                     ` Romano
  2024-02-20 20:18                       ` Alex Deucher
  0 siblings, 1 reply; 28+ messages in thread
From: Romano @ 2024-02-20 19:41 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Linux regressions mailing list, Hans de Goede, Alex Deucher,
	Christian König, Pan, Xinhui, Ma Jun, amd-gfx, Dave Airlie,
	Daniel Vetter, Greg KH

If the increased low range is allowed via boot option, like in proposed 
patch, user clearly made an intentional decision. Undefined, but won't 
fry his hardware for sure. Undefined is also overclocking in that 
matter. You can go out of range with ratio of voltage vs frequency(still 
within vendor's limits) for example and crash the system.



On 2/20/24 19:09, Alex Deucher wrote:
> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>> For Windows, apps like MSI Afterburner is the one to try and what most
>> people go for. Using it in the past myself, I would be surprised if it
>> adhered to such a high min power cap. But even if it did, why would we
>> have to.
>>
>> Relying on vendors cap in this case has already proven wrong because
>> things worked for quite some time already and people reported saving
>> significant amount of watts, in my case 90W(!) for <10% perf.
>>
>> Therefore this talk about safety seems rather strange to me and
>> especially so when we are talking about min_cap. Or name me a single
>> case where someone fried his card due to "too low power" set in said
>> variable. Now there was a report, where by going way too low, driver
>> goes opposite into max power. That's it. That can be easily
>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
>> safety standards with that one.
> Because operation outside of the design bounding box is undefined.  It
> might work for some boards but not others.  It's possible some of the
> logic in the firmware or some of the components used on the board may
> not work correctly below a certain limit, or the voltage regulators
> used on a specific board have a minimum requirement that would not be
> an issue if you stick the bounding box.
>
> Alex
>
>> As for solution, what some suggested already exist - a patch posted by
>> fililip on gitlab is probably the way most of you would agree. It
>> introduce a variable that can be set during boot to override min_cap.
>> But he did not pull requested it, so please, if any one of you who have
>> access to code and merge kernel would be kind enough to implement it.
>>
>>
>>
>> On 2/20/24 16:46, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>
>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>> Hi,
>>>>>
>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>> For the record and everyone that lands here: the cause is known now
>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>
>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>
>>>>>>>>> Other mentions:
>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>
>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>>>>>> this there (but might have missed something!). From what I can see I
>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>> The change aligns the driver what has been validated on each board
>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
>>>>>>>> validated range can lead to undefined behavior and could potentially
>>>>>>>> damage your hardware.
>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>>>>>
>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>>>>>> some solution here so that users that really want to, can continue to do
>>>>>>> what was possible out-of-the box before. Is that possible to realize or
>>>>>>> even supported already?
>>>>>>>
>>>>>>> And sure, those users would be running their hardware outside of its
>>>>>>> specifications. But is that different from overclocking (which the
>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>> When the silicon validation teams set the bounding box for a device,
>>>>>> they set a range of values where it's reasonable to operate based on
>>>>>> the characteristics of the design.
>>>>>>
>>>>>> If we did want to allow extended underclocking, we need a big warning
>>>>>> in the logs at the very least.
>>>>> Requiring a module-option to be set to allow this, as well as a big
>>>>> warning in the logs sounds like a good solution to me.
>>>> Yeah, especially as it sounds from some of the reports as if some
>>>> vendors did a really bad job when it came to setting the proper
>>>> lower-bound limits are now adhered -- and thus higher then what we used
>>>> out-of-the box before 1958946858a62b was applied.
>>>>
>>>> Side note: I assume those "lower bounds checking" is done round about
>>>> the same way by the Windows driver? Does that one allow users to go
>>>> lower somehow? Say after modifying the registry or something like that?
>>>> Or through external tools?
>>> Windows uses the same limit.  I'm not aware of any way to override the
>>> limit on windows off hand.
>>>
>>> Alex
>>>
>>>
>>>> Ciao, Thorsten
>>>>
>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>>>>>> let me put it here:
>>>>>>>>>
>>>>>>>>> """
>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>
>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>>>>>
>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>
>>>>>>>>>>       drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>
>>>>>>>>>>       Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>>>>>       For other Asics, we still use 0 as the default value.
>>>>>>>>>>
>>>>>>>>>>       Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>       Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>       Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>>>>
>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>>>>>> freedom have to stick to such very high reference for min values without
>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>>>>>> ago(business strategy?):
>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>>>>>> or reference values here either, just a move to lower the range of
>>>>>>>>> options for whatever reason.
>>>>>>>>>> I don't know how much power you guys have over them, but please
>>>>>>>>> consider either reverting this change, or give us an option to set
>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>>>>>> """
>>>>>>>>>
>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>
>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>>>>>> regressions; the text you find below is based on a few templates
>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>
>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>>>>> tracking bot:
>>>>>>>>>
>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>
>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>> --
>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 19:41                     ` Romano
@ 2024-02-20 20:18                       ` Alex Deucher
  2024-02-20 21:30                         ` Romano
  2024-02-21  6:06                         ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 2 replies; 28+ messages in thread
From: Alex Deucher @ 2024-02-20 20:18 UTC (permalink / raw)
  To: Romano
  Cc: Linux regressions mailing list, Hans de Goede, Alex Deucher,
	Christian König, Pan, Xinhui, Ma Jun, amd-gfx, Dave Airlie,
	Daniel Vetter, Greg KH

On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>
> If the increased low range is allowed via boot option, like in proposed
> patch, user clearly made an intentional decision. Undefined, but won't
> fry his hardware for sure. Undefined is also overclocking in that
> matter. You can go out of range with ratio of voltage vs frequency(still
> within vendor's limits) for example and crash the system.

This whole thing reminds me of this:
https://xkcd.com/1172/
The problem is another module parameter is another interface to
maintain and validate.  Moreover, we've had a number of cases in the
past where users have under or overclocked and reported bugs or
stability issues and it did not come to light that they were doing
that until we'd already spent a good deal of time trying to debug the
issue.  This obviously can still happen if you allow any sort of over
or underclocking, but at least if you stick to the limits you are
staying within the bounding box of the design.

Alex

>
>
>
> On 2/20/24 19:09, Alex Deucher wrote:
> > On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
> >> For Windows, apps like MSI Afterburner is the one to try and what most
> >> people go for. Using it in the past myself, I would be surprised if it
> >> adhered to such a high min power cap. But even if it did, why would we
> >> have to.
> >>
> >> Relying on vendors cap in this case has already proven wrong because
> >> things worked for quite some time already and people reported saving
> >> significant amount of watts, in my case 90W(!) for <10% perf.
> >>
> >> Therefore this talk about safety seems rather strange to me and
> >> especially so when we are talking about min_cap. Or name me a single
> >> case where someone fried his card due to "too low power" set in said
> >> variable. Now there was a report, where by going way too low, driver
> >> goes opposite into max power. That's it. That can be easily
> >> detected(vents going crazy etc.) and reverted. It is a max_cap that
> >> protect HW(also above scenario), not a min_cap. Feel free to adhere to
> >> safety standards with that one.
> > Because operation outside of the design bounding box is undefined.  It
> > might work for some boards but not others.  It's possible some of the
> > logic in the firmware or some of the components used on the board may
> > not work correctly below a certain limit, or the voltage regulators
> > used on a specific board have a minimum requirement that would not be
> > an issue if you stick the bounding box.
> >
> > Alex
> >
> >> As for solution, what some suggested already exist - a patch posted by
> >> fililip on gitlab is probably the way most of you would agree. It
> >> introduce a variable that can be set during boot to override min_cap.
> >> But he did not pull requested it, so please, if any one of you who have
> >> access to code and merge kernel would be kind enough to implement it.
> >>
> >>
> >>
> >> On 2/20/24 16:46, Alex Deucher wrote:
> >>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
> >>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>
> >>>> On 20.02.24 16:27, Hans de Goede wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On 2/20/24 16:15, Alex Deucher wrote:
> >>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> >>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
> >>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> >>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
> >>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> >>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> >>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> >>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> >>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
> >>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> >>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
> >>>>>>>>> For the record and everyone that lands here: the cause is known now
> >>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> >>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
> >>>>>>>>>
> >>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>>>>>
> >>>>>>>>> Other mentions:
> >>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> >>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
> >>>>>>>>>
> >>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> >>>>>>>>> this there (but might have missed something!). From what I can see I
> >>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
> >>>>>>>>> overall might be a bad idea here. We'll see I guess.
> >>>>>>>> The change aligns the driver what has been validated on each board
> >>>>>>>> design.  Windows uses the same limits.  Using values lower than the
> >>>>>>>> validated range can lead to undefined behavior and could potentially
> >>>>>>>> damage your hardware.
> >>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
> >>>>>>>
> >>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
> >>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
> >>>>>>> some solution here so that users that really want to, can continue to do
> >>>>>>> what was possible out-of-the box before. Is that possible to realize or
> >>>>>>> even supported already?
> >>>>>>>
> >>>>>>> And sure, those users would be running their hardware outside of its
> >>>>>>> specifications. But is that different from overclocking (which the
> >>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
> >>>>>> Sure.  The driver has always had upper bound limits for overclocking,
> >>>>>> this change adds lower bounds checking for underclocking as well.
> >>>>>> When the silicon validation teams set the bounding box for a device,
> >>>>>> they set a range of values where it's reasonable to operate based on
> >>>>>> the characteristics of the design.
> >>>>>>
> >>>>>> If we did want to allow extended underclocking, we need a big warning
> >>>>>> in the logs at the very least.
> >>>>> Requiring a module-option to be set to allow this, as well as a big
> >>>>> warning in the logs sounds like a good solution to me.
> >>>> Yeah, especially as it sounds from some of the reports as if some
> >>>> vendors did a really bad job when it came to setting the proper
> >>>> lower-bound limits are now adhered -- and thus higher then what we used
> >>>> out-of-the box before 1958946858a62b was applied.
> >>>>
> >>>> Side note: I assume those "lower bounds checking" is done round about
> >>>> the same way by the Windows driver? Does that one allow users to go
> >>>> lower somehow? Say after modifying the registry or something like that?
> >>>> Or through external tools?
> >>> Windows uses the same limit.  I'm not aware of any way to override the
> >>> limit on windows off hand.
> >>>
> >>> Alex
> >>>
> >>>
> >>>> Ciao, Thorsten
> >>>>
> >>>>>>>>> Roman posted something that apparently was meant to go to the list, so
> >>>>>>>>> let me put it here:
> >>>>>>>>>
> >>>>>>>>> """
> >>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
> >>>>>>>>> discussion is on gitlab link below.
> >>>>>>>>>
> >>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
> >>>>>>>>>>
> >>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
> >>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
> >>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
> >>>>>>>>>>
> >>>>>>>>>>       drm/amd/pm: Support for getting power1_cap_min value
> >>>>>>>>>>
> >>>>>>>>>>       Support for getting power1_cap_min value on smu13 and smu11.
> >>>>>>>>>>       For other Asics, we still use 0 as the default value.
> >>>>>>>>>>
> >>>>>>>>>>       Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> >>>>>>>>>>       Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> >>>>>>>>>>       Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>>>>>>>>>
> >>>>>>>>>> However, this is not good as it remove under-powering range too far. I
> >>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
> >>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
> >>>>>>>>> freedom have to stick to such very high reference for min values without
> >>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
> >>>>>>>>> guy and I wonder if because of maybe this post that I made few months
> >>>>>>>>> ago(business strategy?):
> >>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> >>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
> >>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
> >>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
> >>>>>>>>> or reference values here either, just a move to lower the range of
> >>>>>>>>> options for whatever reason.
> >>>>>>>>>> I don't know how much power you guys have over them, but please
> >>>>>>>>> consider either reverting this change, or give us an option to set
> >>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
> >>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
> >>>>>>>>> """
> >>>>>>>>>
> >>>>>>>>> And while at it, let me add this issue to the tracking as well
> >>>>>>>>>
> >>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >>>>>>>>> regressions; the text you find below is based on a few templates
> >>>>>>>>> paragraphs you might have encountered already in similar form.
> >>>>>>>>> See link in footer if these mails annoy you.]
> >>>>>>>>>
> >>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
> >>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >>>>>>>>> tracking bot:
> >>>>>>>>>
> >>>>>>>>> #regzbot introduced 1958946858a62b /
> >>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
> >>>>>>>>>
> >>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>>>>>> --
> >>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
> >>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>>>>>>> That page also explains what to do if mails like this annoy you.
> >>>>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 20:18                       ` Alex Deucher
@ 2024-02-20 21:30                         ` Romano
  2024-02-21  6:06                         ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 0 replies; 28+ messages in thread
From: Romano @ 2024-02-20 21:30 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Linux regressions mailing list, Hans de Goede, Alex Deucher,
	Christian König, Pan, Xinhui, Ma Jun, amd-gfx, Dave Airlie,
	Daniel Vetter, Greg KH

This setting does not introduce stability problems or bugs. 
Voltage/frequency ratio is dynamic relative to power cap, GPU auto 
adjust to it. This is not like lowering voltage alone. By lowering GPU 
power, it simply auto-adjust its frequency and voltage on the fly and 
remain stable without crashes. If you lower power way too far, GPU flip 
to maximum power usage on its own, as reported. So both lower than 
vendors are not as undefined as it seems and safety checks are done 
outside vendors range as well.

As for maintenance, patch is literally single "if" switch and a boot 
option.

Idea that you spare yourself extra trouble from reports by not 
implementing this is also false. If this patch is not implemented, I can 
say with confidence that people will end up patching their kernels(I 
know I would) due to how much power can this option save. It is way too 
important. You will still end up with reports, only this time without 
even be aware of the patch because it will be unofficial, "in-house" 
made. And probably forget this thing even existed later on. You also 
introduce extra work to the users, it will not be simple "pacman -Syu" 
anymore, but hassle of whole kernel setup, patching and recompilation on 
the user's side.



On 2/20/24 21:18, Alex Deucher wrote:
> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>> If the increased low range is allowed via boot option, like in proposed
>> patch, user clearly made an intentional decision. Undefined, but won't
>> fry his hardware for sure. Undefined is also overclocking in that
>> matter. You can go out of range with ratio of voltage vs frequency(still
>> within vendor's limits) for example and crash the system.
> This whole thing reminds me of this:
> https://xkcd.com/1172/
> The problem is another module parameter is another interface to
> maintain and validate.  Moreover, we've had a number of cases in the
> past where users have under or overclocked and reported bugs or
> stability issues and it did not come to light that they were doing
> that until we'd already spent a good deal of time trying to debug the
> issue.  This obviously can still happen if you allow any sort of over
> or underclocking, but at least if you stick to the limits you are
> staying within the bounding box of the design.
>
> Alex
>
>>
>>
>> On 2/20/24 19:09, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>>>> For Windows, apps like MSI Afterburner is the one to try and what most
>>>> people go for. Using it in the past myself, I would be surprised if it
>>>> adhered to such a high min power cap. But even if it did, why would we
>>>> have to.
>>>>
>>>> Relying on vendors cap in this case has already proven wrong because
>>>> things worked for quite some time already and people reported saving
>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>
>>>> Therefore this talk about safety seems rather strange to me and
>>>> especially so when we are talking about min_cap. Or name me a single
>>>> case where someone fried his card due to "too low power" set in said
>>>> variable. Now there was a report, where by going way too low, driver
>>>> goes opposite into max power. That's it. That can be easily
>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
>>>> safety standards with that one.
>>> Because operation outside of the design bounding box is undefined.  It
>>> might work for some boards but not others.  It's possible some of the
>>> logic in the firmware or some of the components used on the board may
>>> not work correctly below a certain limit, or the voltage regulators
>>> used on a specific board have a minimum requirement that would not be
>>> an issue if you stick the bounding box.
>>>
>>> Alex
>>>
>>>> As for solution, what some suggested already exist - a patch posted by
>>>> fililip on gitlab is probably the way most of you would agree. It
>>>> introduce a variable that can be set during boot to override min_cap.
>>>> But he did not pull requested it, so please, if any one of you who have
>>>> access to code and merge kernel would be kind enough to implement it.
>>>>
>>>>
>>>>
>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>> For the record and everyone that lands here: the cause is known now
>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>
>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>
>>>>>>>>>>> Other mentions:
>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>
>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>>>>>>>> this there (but might have missed something!). From what I can see I
>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>> The change aligns the driver what has been validated on each board
>>>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
>>>>>>>>>> validated range can lead to undefined behavior and could potentially
>>>>>>>>>> damage your hardware.
>>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>>>>>>>
>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>>>>>>>> some solution here so that users that really want to, can continue to do
>>>>>>>>> what was possible out-of-the box before. Is that possible to realize or
>>>>>>>>> even supported already?
>>>>>>>>>
>>>>>>>>> And sure, those users would be running their hardware outside of its
>>>>>>>>> specifications. But is that different from overclocking (which the
>>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
>>>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
>>>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>>>> When the silicon validation teams set the bounding box for a device,
>>>>>>>> they set a range of values where it's reasonable to operate based on
>>>>>>>> the characteristics of the design.
>>>>>>>>
>>>>>>>> If we did want to allow extended underclocking, we need a big warning
>>>>>>>> in the logs at the very least.
>>>>>>> Requiring a module-option to be set to allow this, as well as a big
>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>> lower-bound limits are now adhered -- and thus higher then what we used
>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>
>>>>>> Side note: I assume those "lower bounds checking" is done round about
>>>>>> the same way by the Windows driver? Does that one allow users to go
>>>>>> lower somehow? Say after modifying the registry or something like that?
>>>>>> Or through external tools?
>>>>> Windows uses the same limit.  I'm not aware of any way to override the
>>>>> limit on windows off hand.
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>> Ciao, Thorsten
>>>>>>
>>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>>>>>>>> let me put it here:
>>>>>>>>>>>
>>>>>>>>>>> """
>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>
>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>>>>>>>
>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>
>>>>>>>>>>>>        drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>
>>>>>>>>>>>>        Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>>>>>>>        For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>
>>>>>>>>>>>>        Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>        Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>>>        Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>>>>>>>> freedom have to stick to such very high reference for min values without
>>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>>>>>>>> or reference values here either, just a move to lower the range of
>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>> I don't know how much power you guys have over them, but please
>>>>>>>>>>> consider either reverting this change, or give us an option to set
>>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>>>>>>>> """
>>>>>>>>>>>
>>>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>>>
>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>>>>>>>> regressions; the text you find below is based on a few templates
>>>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>>>>>>> tracking bot:
>>>>>>>>>>>
>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>
>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>>>> --
>>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>> That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-20 20:18                       ` Alex Deucher
  2024-02-20 21:30                         ` Romano
@ 2024-02-21  6:06                         ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-21 15:15                           ` Christian König
  2024-02-21 15:39                           ` Alex Deucher
  1 sibling, 2 replies; 28+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-21  6:06 UTC (permalink / raw)
  To: Alex Deucher, Romano
  Cc: Linux regressions mailing list, Hans de Goede, Alex Deucher,
	Christian König, Pan, Xinhui, Ma Jun, amd-gfx, Dave Airlie,
	Daniel Vetter, Greg KH

On 20.02.24 21:18, Alex Deucher wrote:
> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>>
>> If the increased low range is allowed via boot option, like in proposed
>> patch, user clearly made an intentional decision. Undefined, but won't
>> fry his hardware for sure. Undefined is also overclocking in that
>> matter. You can go out of range with ratio of voltage vs frequency(still
>> within vendor's limits) for example and crash the system.
> 
> This whole thing reminds me of this:
> https://xkcd.com/1172/
> The problem is another module parameter is another interface to
> maintain and validate.

Yup, of course, all that is understood.

But we have this "no regressions" rule for a reason. Adhering to it
strictly would afaics be counter-productive in this situation, but give
users some way to manually do what was possible before out-of-the box
IMHO is the minimum we should do.

Maybe just allow that parameter only up to a certain recent GPU
generation, that way you won't have to deal with that at some point in
the future.

>  Moreover, we've had a number of cases in the
> past where users have under or overclocked and reported bugs or
> stability issues and it did not come to light that they were doing
> that until we'd already spent a good deal of time trying to debug the
> issue.

Taint the kernel when that module parameter is used? We iirc have a
taint bit exactly for this sort of situation. Sure, such reports will
still happen, but then you at least have an indicator to spot them.

Ciao, Thorsten

>  This obviously can still happen if you allow any sort of over
> or underclocking, but at least if you stick to the limits you are
> staying within the bounding box of the design.
> 
> Alex
>
>> On 2/20/24 19:09, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>>>> For Windows, apps like MSI Afterburner is the one to try and what most
>>>> people go for. Using it in the past myself, I would be surprised if it
>>>> adhered to such a high min power cap. But even if it did, why would we
>>>> have to.
>>>>
>>>> Relying on vendors cap in this case has already proven wrong because
>>>> things worked for quite some time already and people reported saving
>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>
>>>> Therefore this talk about safety seems rather strange to me and
>>>> especially so when we are talking about min_cap. Or name me a single
>>>> case where someone fried his card due to "too low power" set in said
>>>> variable. Now there was a report, where by going way too low, driver
>>>> goes opposite into max power. That's it. That can be easily
>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
>>>> safety standards with that one.
>>> Because operation outside of the design bounding box is undefined.  It
>>> might work for some boards but not others.  It's possible some of the
>>> logic in the firmware or some of the components used on the board may
>>> not work correctly below a certain limit, or the voltage regulators
>>> used on a specific board have a minimum requirement that would not be
>>> an issue if you stick the bounding box.
>>>
>>> Alex
>>>
>>>> As for solution, what some suggested already exist - a patch posted by
>>>> fililip on gitlab is probably the way most of you would agree. It
>>>> introduce a variable that can be set during boot to override min_cap.
>>>> But he did not pull requested it, so please, if any one of you who have
>>>> access to code and merge kernel would be kind enough to implement it.
>>>>
>>>>
>>>>
>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>
>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>> For the record and everyone that lands here: the cause is known now
>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>
>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>
>>>>>>>>>>> Other mentions:
>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>
>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>>>>>>>> this there (but might have missed something!). From what I can see I
>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>> The change aligns the driver what has been validated on each board
>>>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
>>>>>>>>>> validated range can lead to undefined behavior and could potentially
>>>>>>>>>> damage your hardware.
>>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>>>>>>>
>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>>>>>>>> some solution here so that users that really want to, can continue to do
>>>>>>>>> what was possible out-of-the box before. Is that possible to realize or
>>>>>>>>> even supported already?
>>>>>>>>>
>>>>>>>>> And sure, those users would be running their hardware outside of its
>>>>>>>>> specifications. But is that different from overclocking (which the
>>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
>>>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
>>>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>>>> When the silicon validation teams set the bounding box for a device,
>>>>>>>> they set a range of values where it's reasonable to operate based on
>>>>>>>> the characteristics of the design.
>>>>>>>>
>>>>>>>> If we did want to allow extended underclocking, we need a big warning
>>>>>>>> in the logs at the very least.
>>>>>>> Requiring a module-option to be set to allow this, as well as a big
>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>> lower-bound limits are now adhered -- and thus higher then what we used
>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>
>>>>>> Side note: I assume those "lower bounds checking" is done round about
>>>>>> the same way by the Windows driver? Does that one allow users to go
>>>>>> lower somehow? Say after modifying the registry or something like that?
>>>>>> Or through external tools?
>>>>> Windows uses the same limit.  I'm not aware of any way to override the
>>>>> limit on windows off hand.
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>> Ciao, Thorsten
>>>>>>
>>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>>>>>>>> let me put it here:
>>>>>>>>>>>
>>>>>>>>>>> """
>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>
>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>>>>>>>
>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>
>>>>>>>>>>>>       drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>
>>>>>>>>>>>>       Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>>>>>>>       For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>
>>>>>>>>>>>>       Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>       Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>>>       Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>>>>>>>> freedom have to stick to such very high reference for min values without
>>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>>>>>>>> or reference values here either, just a move to lower the range of
>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>> I don't know how much power you guys have over them, but please
>>>>>>>>>>> consider either reverting this change, or give us an option to set
>>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>>>>>>>> """
>>>>>>>>>>>
>>>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>>>
>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>>>>>>>> regressions; the text you find below is based on a few templates
>>>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>>>>>>> tracking bot:
>>>>>>>>>>>
>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>
>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>>>> --
>>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>>>
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21  6:06                         ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-21 15:15                           ` Christian König
  2024-02-21 15:44                             ` Thorsten Leemhuis
  2024-02-21 16:47                             ` Romano
  2024-02-21 15:39                           ` Alex Deucher
  1 sibling, 2 replies; 28+ messages in thread
From: Christian König @ 2024-02-21 15:15 UTC (permalink / raw)
  To: Linux regressions mailing list, Alex Deucher, Romano
  Cc: Hans de Goede, Alex Deucher, Christian König, Pan, Xinhui,
	Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

Am 21.02.24 um 07:06 schrieb Linux regression tracking (Thorsten Leemhuis):
> On 20.02.24 21:18, Alex Deucher wrote:
>> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>>> If the increased low range is allowed via boot option, like in proposed
>>> patch, user clearly made an intentional decision. Undefined, but won't
>>> fry his hardware for sure. Undefined is also overclocking in that
>>> matter. You can go out of range with ratio of voltage vs frequency(still
>>> within vendor's limits) for example and crash the system.
>> This whole thing reminds me of this:
>> https://xkcd.com/1172/
>> The problem is another module parameter is another interface to
>> maintain and validate.
> Yup, of course, all that is understood.
>
> But we have this "no regressions" rule for a reason. Adhering to it
> strictly would afaics be counter-productive in this situation, but give
> users some way to manually do what was possible before out-of-the box
> IMHO is the minimum we should do.
>
> Maybe just allow that parameter only up to a certain recent GPU
> generation, that way you won't have to deal with that at some point in
> the future.
>
>>   Moreover, we've had a number of cases in the
>> past where users have under or overclocked and reported bugs or
>> stability issues and it did not come to light that they were doing
>> that until we'd already spent a good deal of time trying to debug the
>> issue.
> Taint the kernel when that module parameter is used? We iirc have a
> taint bit exactly for this sort of situation. Sure, such reports will
> still happen, but then you at least have an indicator to spot them.

Let me recap what happened here:

1. AMD is the GPU manufacturer, but apart from a few exceptions doesn't 
assemble boards.

2. Vendors take AMDs GPUs and assemble them together with power 
regulators, memory and a bunch of other components into PCIe board.

3. AMD provides a vendor agnostic driver and for this to work vendors 
describe to the min/max voltage their power regulators can do in some 
flash memory.

4. Hardware engineers point out that AMDs open source drivers are not 
respecting the min value.

5. In response a patch was applied to respect that value and not use 
something outside of the hardware specification the vendor provided.

I'm not sure about it but I think AMD need to respect the min/max values 
simply by contract and it's not really an option to not do that.

If someone really want to run your hardware outside the vendor 
recommended values that person can still patch the driver to ignore the 
limits. It's just that then AMD is not responsible for any damage 
resulting from that.

So as far as I can see the request to make that a module option is a 
no-go, especially since hardware engineers have explicitly pointed out 
that we have to do this in the software stack.

Regards,
Christian.

>
> Ciao, Thorsten
>
>>   This obviously can still happen if you allow any sort of over
>> or underclocking, but at least if you stick to the limits you are
>> staying within the bounding box of the design.
>>
>> Alex
>>
>>> On 2/20/24 19:09, Alex Deucher wrote:
>>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>>>>> For Windows, apps like MSI Afterburner is the one to try and what most
>>>>> people go for. Using it in the past myself, I would be surprised if it
>>>>> adhered to such a high min power cap. But even if it did, why would we
>>>>> have to.
>>>>>
>>>>> Relying on vendors cap in this case has already proven wrong because
>>>>> things worked for quite some time already and people reported saving
>>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>>
>>>>> Therefore this talk about safety seems rather strange to me and
>>>>> especially so when we are talking about min_cap. Or name me a single
>>>>> case where someone fried his card due to "too low power" set in said
>>>>> variable. Now there was a report, where by going way too low, driver
>>>>> goes opposite into max power. That's it. That can be easily
>>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
>>>>> safety standards with that one.
>>>> Because operation outside of the design bounding box is undefined.  It
>>>> might work for some boards but not others.  It's possible some of the
>>>> logic in the firmware or some of the components used on the board may
>>>> not work correctly below a certain limit, or the voltage regulators
>>>> used on a specific board have a minimum requirement that would not be
>>>> an issue if you stick the bounding box.
>>>>
>>>> Alex
>>>>
>>>>> As for solution, what some suggested already exist - a patch posted by
>>>>> fililip on gitlab is probably the way most of you would agree. It
>>>>> introduce a variable that can be set during boot to override min_cap.
>>>>> But he did not pull requested it, so please, if any one of you who have
>>>>> access to code and merge kernel would be kind enough to implement it.
>>>>>
>>>>>
>>>>>
>>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>>> For the record and everyone that lands here: the cause is known now
>>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>>
>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>
>>>>>>>>>>>> Other mentions:
>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>>
>>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>>>>>>>>> this there (but might have missed something!). From what I can see I
>>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>>> The change aligns the driver what has been validated on each board
>>>>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
>>>>>>>>>>> validated range can lead to undefined behavior and could potentially
>>>>>>>>>>> damage your hardware.
>>>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>>>>>>>>
>>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>>>>>>>>> some solution here so that users that really want to, can continue to do
>>>>>>>>>> what was possible out-of-the box before. Is that possible to realize or
>>>>>>>>>> even supported already?
>>>>>>>>>>
>>>>>>>>>> And sure, those users would be running their hardware outside of its
>>>>>>>>>> specifications. But is that different from overclocking (which the
>>>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
>>>>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
>>>>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>>>>> When the silicon validation teams set the bounding box for a device,
>>>>>>>>> they set a range of values where it's reasonable to operate based on
>>>>>>>>> the characteristics of the design.
>>>>>>>>>
>>>>>>>>> If we did want to allow extended underclocking, we need a big warning
>>>>>>>>> in the logs at the very least.
>>>>>>>> Requiring a module-option to be set to allow this, as well as a big
>>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>>> lower-bound limits are now adhered -- and thus higher then what we used
>>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>>
>>>>>>> Side note: I assume those "lower bounds checking" is done round about
>>>>>>> the same way by the Windows driver? Does that one allow users to go
>>>>>>> lower somehow? Say after modifying the registry or something like that?
>>>>>>> Or through external tools?
>>>>>> Windows uses the same limit.  I'm not aware of any way to override the
>>>>>> limit on windows off hand.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>
>>>>>>> Ciao, Thorsten
>>>>>>>
>>>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>>>>>>>>> let me put it here:
>>>>>>>>>>>>
>>>>>>>>>>>> """
>>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>>
>>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>>
>>>>>>>>>>>>>        drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>>>>>>>>        For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>        Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>>>>        Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>>>>>>>>> freedom have to stick to such very high reference for min values without
>>>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>>>>>>>>> or reference values here either, just a move to lower the range of
>>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>>> I don't know how much power you guys have over them, but please
>>>>>>>>>>>> consider either reverting this change, or give us an option to set
>>>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>>>>>>>>> """
>>>>>>>>>>>>
>>>>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>>>>
>>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>>>>>>>>> regressions; the text you find below is based on a few templates
>>>>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>>>>>>>> tracking bot:
>>>>>>>>>>>>
>>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>>
>>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>>>>> --
>>>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>>> That page also explains what to do if mails like this annoy you.
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21  6:06                         ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-21 15:15                           ` Christian König
@ 2024-02-21 15:39                           ` Alex Deucher
  2024-02-21 15:53                             ` Linux regression tracking (Thorsten Leemhuis)
                                               ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Alex Deucher @ 2024-02-21 15:39 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Romano, Hans de Goede, Alex Deucher, Christian König, Pan,
	Xinhui, Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

On Wed, Feb 21, 2024 at 1:06 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> On 20.02.24 21:18, Alex Deucher wrote:
> > On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
> >>
> >> If the increased low range is allowed via boot option, like in proposed
> >> patch, user clearly made an intentional decision. Undefined, but won't
> >> fry his hardware for sure. Undefined is also overclocking in that
> >> matter. You can go out of range with ratio of voltage vs frequency(still
> >> within vendor's limits) for example and crash the system.
> >
> > This whole thing reminds me of this:
> > https://xkcd.com/1172/
> > The problem is another module parameter is another interface to
> > maintain and validate.
>
> Yup, of course, all that is understood.
>
> But we have this "no regressions" rule for a reason. Adhering to it
> strictly would afaics be counter-productive in this situation, but give
> users some way to manually do what was possible before out-of-the box
> IMHO is the minimum we should do.
>
> Maybe just allow that parameter only up to a certain recent GPU
> generation, that way you won't have to deal with that at some point in
> the future.

The problem is the cumulative effect of all of these parameters.
Every time there is some change in the driver someone disagrees with
there is a push to add a module parameter for it.  The driver already
has too many module parameters and it's hard to keep them all working
consistently and in every possible combination.  Moreover, the module
options are supposed to be mainly for debugging.  The driver sets
proper defaults for all chips to ensure proper operation, however lots
of random forums seem to treat them like they are the recipe for some
special sauce so users are constantly setting various combinations of
them because they read somewhere on a forum that it would make their
GPU run faster.  More often than not this leads to problems.

Even if we did make the option only valid for these specific chips,
there will be an expectation that future chips will support it as
well, because someone will hack the driver and test it and it may work
for them and then there will be a push to add it for those chips too.

Alex

> >  Moreover, we've had a number of cases in the
> > past where users have under or overclocked and reported bugs or
> > stability issues and it did not come to light that they were doing
> > that until we'd already spent a good deal of time trying to debug the
> > issue.
>
> Taint the kernel when that module parameter is used? We iirc have a
> taint bit exactly for this sort of situation. Sure, such reports will
> still happen, but then you at least have an indicator to spot them.
>
> Ciao, Thorsten
>
> >  This obviously can still happen if you allow any sort of over
> > or underclocking, but at least if you stick to the limits you are
> > staying within the bounding box of the design.
> >
> > Alex
> >
> >> On 2/20/24 19:09, Alex Deucher wrote:
> >>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
> >>>> For Windows, apps like MSI Afterburner is the one to try and what most
> >>>> people go for. Using it in the past myself, I would be surprised if it
> >>>> adhered to such a high min power cap. But even if it did, why would we
> >>>> have to.
> >>>>
> >>>> Relying on vendors cap in this case has already proven wrong because
> >>>> things worked for quite some time already and people reported saving
> >>>> significant amount of watts, in my case 90W(!) for <10% perf.
> >>>>
> >>>> Therefore this talk about safety seems rather strange to me and
> >>>> especially so when we are talking about min_cap. Or name me a single
> >>>> case where someone fried his card due to "too low power" set in said
> >>>> variable. Now there was a report, where by going way too low, driver
> >>>> goes opposite into max power. That's it. That can be easily
> >>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
> >>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
> >>>> safety standards with that one.
> >>> Because operation outside of the design bounding box is undefined.  It
> >>> might work for some boards but not others.  It's possible some of the
> >>> logic in the firmware or some of the components used on the board may
> >>> not work correctly below a certain limit, or the voltage regulators
> >>> used on a specific board have a minimum requirement that would not be
> >>> an issue if you stick the bounding box.
> >>>
> >>> Alex
> >>>
> >>>> As for solution, what some suggested already exist - a patch posted by
> >>>> fililip on gitlab is probably the way most of you would agree. It
> >>>> introduce a variable that can be set during boot to override min_cap.
> >>>> But he did not pull requested it, so please, if any one of you who have
> >>>> access to code and merge kernel would be kind enough to implement it.
> >>>>
> >>>>
> >>>>
> >>>> On 2/20/24 16:46, Alex Deucher wrote:
> >>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
> >>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>>>
> >>>>>> On 20.02.24 16:27, Hans de Goede wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
> >>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> >>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
> >>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> >>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> >>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
> >>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> >>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> >>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> >>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> >>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
> >>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> >>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
> >>>>>>>>>>> For the record and everyone that lands here: the cause is known now
> >>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> >>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
> >>>>>>>>>>>
> >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>>>>>>>
> >>>>>>>>>>> Other mentions:
> >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
> >>>>>>>>>>>
> >>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> >>>>>>>>>>> this there (but might have missed something!). From what I can see I
> >>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
> >>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
> >>>>>>>>>> The change aligns the driver what has been validated on each board
> >>>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
> >>>>>>>>>> validated range can lead to undefined behavior and could potentially
> >>>>>>>>>> damage your hardware.
> >>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
> >>>>>>>>>
> >>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
> >>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
> >>>>>>>>> some solution here so that users that really want to, can continue to do
> >>>>>>>>> what was possible out-of-the box before. Is that possible to realize or
> >>>>>>>>> even supported already?
> >>>>>>>>>
> >>>>>>>>> And sure, those users would be running their hardware outside of its
> >>>>>>>>> specifications. But is that different from overclocking (which the
> >>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
> >>>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
> >>>>>>>> this change adds lower bounds checking for underclocking as well.
> >>>>>>>> When the silicon validation teams set the bounding box for a device,
> >>>>>>>> they set a range of values where it's reasonable to operate based on
> >>>>>>>> the characteristics of the design.
> >>>>>>>>
> >>>>>>>> If we did want to allow extended underclocking, we need a big warning
> >>>>>>>> in the logs at the very least.
> >>>>>>> Requiring a module-option to be set to allow this, as well as a big
> >>>>>>> warning in the logs sounds like a good solution to me.
> >>>>>> Yeah, especially as it sounds from some of the reports as if some
> >>>>>> vendors did a really bad job when it came to setting the proper
> >>>>>> lower-bound limits are now adhered -- and thus higher then what we used
> >>>>>> out-of-the box before 1958946858a62b was applied.
> >>>>>>
> >>>>>> Side note: I assume those "lower bounds checking" is done round about
> >>>>>> the same way by the Windows driver? Does that one allow users to go
> >>>>>> lower somehow? Say after modifying the registry or something like that?
> >>>>>> Or through external tools?
> >>>>> Windows uses the same limit.  I'm not aware of any way to override the
> >>>>> limit on windows off hand.
> >>>>>
> >>>>> Alex
> >>>>>
> >>>>>
> >>>>>> Ciao, Thorsten
> >>>>>>
> >>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
> >>>>>>>>>>> let me put it here:
> >>>>>>>>>>>
> >>>>>>>>>>> """
> >>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
> >>>>>>>>>>> discussion is on gitlab link below.
> >>>>>>>>>>>
> >>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
> >>>>>>>>>>>>
> >>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> >>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
> >>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
> >>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
> >>>>>>>>>>>>
> >>>>>>>>>>>>       drm/amd/pm: Support for getting power1_cap_min value
> >>>>>>>>>>>>
> >>>>>>>>>>>>       Support for getting power1_cap_min value on smu13 and smu11.
> >>>>>>>>>>>>       For other Asics, we still use 0 as the default value.
> >>>>>>>>>>>>
> >>>>>>>>>>>>       Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> >>>>>>>>>>>>       Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> >>>>>>>>>>>>       Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>>>>>>>>>>>
> >>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
> >>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
> >>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
> >>>>>>>>>>> freedom have to stick to such very high reference for min values without
> >>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
> >>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
> >>>>>>>>>>> ago(business strategy?):
> >>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> >>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
> >>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
> >>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
> >>>>>>>>>>> or reference values here either, just a move to lower the range of
> >>>>>>>>>>> options for whatever reason.
> >>>>>>>>>>>> I don't know how much power you guys have over them, but please
> >>>>>>>>>>> consider either reverting this change, or give us an option to set
> >>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
> >>>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
> >>>>>>>>>>> """
> >>>>>>>>>>>
> >>>>>>>>>>> And while at it, let me add this issue to the tracking as well
> >>>>>>>>>>>
> >>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >>>>>>>>>>> regressions; the text you find below is based on a few templates
> >>>>>>>>>>> paragraphs you might have encountered already in similar form.
> >>>>>>>>>>> See link in footer if these mails annoy you.]
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
> >>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >>>>>>>>>>> tracking bot:
> >>>>>>>>>>>
> >>>>>>>>>>> #regzbot introduced 1958946858a62b /
> >>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
> >>>>>>>>>>>
> >>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>>>>>>>> --
> >>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
> >>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>>>>>>>>> That page also explains what to do if mails like this annoy you.
> >>>>>>>
> >
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21 15:15                           ` Christian König
@ 2024-02-21 15:44                             ` Thorsten Leemhuis
  2024-02-21 16:47                             ` Romano
  1 sibling, 0 replies; 28+ messages in thread
From: Thorsten Leemhuis @ 2024-02-21 15:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hans de Goede, Alex Deucher, Christian König, Pan, Xinhui,
	Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH,
	Christian König, Linux regressions mailing list,
	Alex Deucher, Romano

[+Linus, as we seem to have reached the point in the discussion about
this regression where that is likely for the best.

And just for the record: I'm *not* doing that because I'm disappointed,
angry, or something. I can relate to the point that was made in the mail
I'm replying to. It's just that this is a tricky situation due to the
"hardware might be damaged or work unreliable" aspect, so it's best if
we all know how Linus wants this to be handled.]

BTW, thread starts here:
https://lore.kernel.org/all/ae64f04d-6e94-4da4-a740-78ea94e0552c@riadoklan.sk.eu.org/

On 21.02.24 16:15, Christian König wrote:
> Am 21.02.24 um 07:06 schrieb Linux regression tracking (Thorsten Leemhuis):
>> On 20.02.24 21:18, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>>>> If the increased low range is allowed via boot option, like in proposed
>>>> patch, user clearly made an intentional decision. Undefined, but won't
>>>> fry his hardware for sure. Undefined is also overclocking in that
>>>> matter. You can go out of range with ratio of voltage vs
>>>> frequency(still
>>>> within vendor's limits) for example and crash the system.
>>> This whole thing reminds me of this:
>>> https://xkcd.com/1172/
>>> The problem is another module parameter is another interface to
>>> maintain and validate.
>> Yup, of course, all that is understood.
>>
>> But we have this "no regressions" rule for a reason. Adhering to it
>> strictly would afaics be counter-productive in this situation, but give
>> users some way to manually do what was possible before out-of-the box
>> IMHO is the minimum we should do.
>>
>> Maybe just allow that parameter only up to a certain recent GPU
>> generation, that way you won't have to deal with that at some point in
>> the future.
>>
>>>   Moreover, we've had a number of cases in the
>>> past where users have under or overclocked and reported bugs or
>>> stability issues and it did not come to light that they were doing
>>> that until we'd already spent a good deal of time trying to debug the
>>> issue.
>> Taint the kernel when that module parameter is used? We iirc have a
>> taint bit exactly for this sort of situation. Sure, such reports will
>> still happen, but then you at least have an indicator to spot them.
> 
> Let me recap what happened here:
> 
> 1. AMD is the GPU manufacturer, but apart from a few exceptions doesn't
> assemble boards.
> 
> 2. Vendors take AMDs GPUs and assemble them together with power
> regulators, memory and a bunch of other components into PCIe board.
> 
> 3. AMD provides a vendor agnostic driver and for this to work vendors
> describe to the min/max voltage their power regulators can do in some
> flash memory.
> 
> 4. Hardware engineers point out that AMDs open source drivers are not
> respecting the min value.
> 
> 5. In response a patch was applied to respect that value and not use
> something outside of the hardware specification the vendor provided.
> 
> I'm not sure about it but I think AMD need to respect the min/max values
> simply by contract and it's not really an option to not do that.
> 
> If someone really want to run your hardware outside the vendor
> recommended values that person can still patch the driver to ignore the
> limits. It's just that then AMD is not responsible for any damage
> resulting from that.
> 
> So as far as I can see the request to make that a module option is a
> no-go, especially since hardware engineers have explicitly pointed out
> that we have to do this in the software stack.

As mentioned above: I can relate to that point of view. But in the end
this is the kernel and "no regressions" is something that is considered
the #1 rule in the development process and especially so by Linus
himself. So let's see if he has something to say here. If he doesn't
reply I'll rest my case. :-D

Ciao, Thorsten

>>>   This obviously can still happen if you allow any sort of over
>>> or underclocking, but at least if you stick to the limits you are
>>> staying within the bounding box of the design.
>>>
>>> Alex
>>>
>>>> On 2/20/24 19:09, Alex Deucher wrote:
>>>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>>>>>> For Windows, apps like MSI Afterburner is the one to try and what
>>>>>> most
>>>>>> people go for. Using it in the past myself, I would be surprised
>>>>>> if it
>>>>>> adhered to such a high min power cap. But even if it did, why
>>>>>> would we
>>>>>> have to.
>>>>>>
>>>>>> Relying on vendors cap in this case has already proven wrong because
>>>>>> things worked for quite some time already and people reported saving
>>>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>>>
>>>>>> Therefore this talk about safety seems rather strange to me and
>>>>>> especially so when we are talking about min_cap. Or name me a single
>>>>>> case where someone fried his card due to "too low power" set in said
>>>>>> variable. Now there was a report, where by going way too low, driver
>>>>>> goes opposite into max power. That's it. That can be easily
>>>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>>>> protect HW(also above scenario), not a min_cap. Feel free to
>>>>>> adhere to
>>>>>> safety standards with that one.
>>>>> Because operation outside of the design bounding box is undefined.  It
>>>>> might work for some boards but not others.  It's possible some of the
>>>>> logic in the firmware or some of the components used on the board may
>>>>> not work correctly below a certain limit, or the voltage regulators
>>>>> used on a specific board have a minimum requirement that would not be
>>>>> an issue if you stick the bounding box.
>>>>>
>>>>> Alex
>>>>>
>>>>>> As for solution, what some suggested already exist - a patch
>>>>>> posted by
>>>>>> fililip on gitlab is probably the way most of you would agree. It
>>>>>> introduce a variable that can be set during boot to override min_cap.
>>>>>> But he did not pull requested it, so please, if any one of you who
>>>>>> have
>>>>>> access to code and merge kernel would be kind enough to implement it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking
>>>>>>>>>> (Thorsten
>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking
>>>>>>>>>>>> (Thorsten
>>>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for
>>>>>>>>>>>>>>> my GPU (RX 6700XT,
>>>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as
>>>>>>>>>>>>>>> before(to 115W),
>>>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a
>>>>>>>>>>>>>>> variable read-only
>>>>>>>>>>>>>>> even for root. This is not of above apps issue but of the
>>>>>>>>>>>>>>> kernel, I read
>>>>>>>>>>>>>>> similar issues from other bug reports of above apps. I
>>>>>>>>>>>>>>> downgraded to v6.6.10
>>>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>>>> For the record and everyone that lands here: the cause is
>>>>>>>>>>>>> known now
>>>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting
>>>>>>>>>>>>> power1_cap_min
>>>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>
>>>>>>>>>>>>> Other mentions:
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>>>
>>>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now
>>>>>>>>>>>>> CCed) yet on
>>>>>>>>>>>>> this there (but might have missed something!). From what I
>>>>>>>>>>>>> can see I
>>>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a
>>>>>>>>>>>>> revert
>>>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>>>> The change aligns the driver what has been validated on each
>>>>>>>>>>>> board
>>>>>>>>>>>> design.  Windows uses the same limits.  Using values lower
>>>>>>>>>>>> than the
>>>>>>>>>>>> validated range can lead to undefined behavior and could
>>>>>>>>>>>> potentially
>>>>>>>>>>>> damage your hardware.
>>>>>>>>>>> Thx for the reply! Yeah, I was expecting something along
>>>>>>>>>>> those lines.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of
>>>>>>>>>>> many users.
>>>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we
>>>>>>>>>>> can find
>>>>>>>>>>> some solution here so that users that really want to, can
>>>>>>>>>>> continue to do
>>>>>>>>>>> what was possible out-of-the box before. Is that possible to
>>>>>>>>>>> realize or
>>>>>>>>>>> even supported already?
>>>>>>>>>>>
>>>>>>>>>>> And sure, those users would be running their hardware outside
>>>>>>>>>>> of its
>>>>>>>>>>> specifications. But is that different from overclocking
>>>>>>>>>>> (which the
>>>>>>>>>>> driver allows, doesn't it? If not by all means please correct
>>>>>>>>>>> me!)?
>>>>>>>>>> Sure.  The driver has always had upper bound limits for
>>>>>>>>>> overclocking,
>>>>>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>>>>>> When the silicon validation teams set the bounding box for a
>>>>>>>>>> device,
>>>>>>>>>> they set a range of values where it's reasonable to operate
>>>>>>>>>> based on
>>>>>>>>>> the characteristics of the design.
>>>>>>>>>>
>>>>>>>>>> If we did want to allow extended underclocking, we need a big
>>>>>>>>>> warning
>>>>>>>>>> in the logs at the very least.
>>>>>>>>> Requiring a module-option to be set to allow this, as well as a
>>>>>>>>> big
>>>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>>>> lower-bound limits are now adhered -- and thus higher then what
>>>>>>>> we used
>>>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>>>
>>>>>>>> Side note: I assume those "lower bounds checking" is done round
>>>>>>>> about
>>>>>>>> the same way by the Windows driver? Does that one allow users to go
>>>>>>>> lower somehow? Say after modifying the registry or something
>>>>>>>> like that?
>>>>>>>> Or through external tools?
>>>>>>> Windows uses the same limit.  I'm not aware of any way to
>>>>>>> override the
>>>>>>> limit on windows off hand.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>>> Ciao, Thorsten
>>>>>>>>
>>>>>>>>>>>>> Roman posted something that apparently was meant to go to
>>>>>>>>>>>>> the list, so
>>>>>>>>>>>>> let me put it here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> """
>>>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to
>>>>>>>>>>>>> be merged,
>>>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using
>>>>>>>>>>>>> original addr.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> it seems that commit was already found(see user's
>>>>>>>>>>>>>> 'fililip' comment):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Support for getting power1_cap_min value on smu13
>>>>>>>>>>>>>> and smu11.
>>>>>>>>>>>>>>        For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>>        Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>>>>>        Signed-off-by: Alex Deucher
>>>>>>>>>>>>>> <alexander.deucher@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, this is not good as it remove under-powering
>>>>>>>>>>>>>> range too far. I
>>>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less
>>>>>>>>>>>>> consumption
>>>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of
>>>>>>>>>>>>> options and
>>>>>>>>>>>>> freedom have to stick to such very high reference for min
>>>>>>>>>>>>> values without
>>>>>>>>>>>>> ability to override them through some sys ctrls. Commit was
>>>>>>>>>>>>> done by amd
>>>>>>>>>>>>> guy and I wonder if because of maybe this post that I made
>>>>>>>>>>>>> few months
>>>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand
>>>>>>>>>>>>>> desire to
>>>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when
>>>>>>>>>>>>> card pull on
>>>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk
>>>>>>>>>>>>> about default
>>>>>>>>>>>>> or reference values here either, just a move to lower the
>>>>>>>>>>>>> range of
>>>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>>>> I don't know how much power you guys have over them, but
>>>>>>>>>>>>>> please
>>>>>>>>>>>>> consider either reverting this change, or give us an option
>>>>>>>>>>>>> to set
>>>>>>>>>>>>> min_cap through say /sys (right now param is readonly, even
>>>>>>>>>>>>> for root).
>>>>>>>>>>>>>> Thank you in advance for looking into this, with regards: 
>>>>>>>>>>>>>> Romano
>>>>>>>>>>>>> """
>>>>>>>>>>>>>
>>>>>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>>>>>
>>>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux
>>>>>>>>>>>>> kernel
>>>>>>>>>>>>> regressions; the text you find below is based on a few
>>>>>>>>>>>>> templates
>>>>>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall
>>>>>>>>>>>>> through the
>>>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux
>>>>>>>>>>>>> kernel regression
>>>>>>>>>>>>> tracking bot:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
>>>>>>>>>>>>> tracker' hat)
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Everything you wanna know about Linux kernel regression
>>>>>>>>>>>>> tracking:
>>>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>>>> That page also explains what to do if mails like this annoy
>>>>>>>>>>>>> you.
>>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21 15:39                           ` Alex Deucher
@ 2024-02-21 15:53                             ` Linux regression tracking (Thorsten Leemhuis)
  2024-03-04 14:12                               ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-21 17:45                             ` Romano
  2024-02-26 13:04                             ` Daniel Vetter
  2 siblings, 1 reply; 28+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-21 15:53 UTC (permalink / raw)
  To: Alex Deucher, Linux regressions mailing list
  Cc: Romano, Hans de Goede, Alex Deucher, Christian König, Pan,
	Xinhui, Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

On 21.02.24 16:39, Alex Deucher wrote:
> On Wed, Feb 21, 2024 at 1:06 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> On 20.02.24 21:18, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>>>>
>>>> If the increased low range is allowed via boot option, like in proposed
>>>> patch, user clearly made an intentional decision. Undefined, but won't
>>>> fry his hardware for sure. Undefined is also overclocking in that
>>>> matter. You can go out of range with ratio of voltage vs frequency(still
>>>> within vendor's limits) for example and crash the system.
>>>
>>> This whole thing reminds me of this:
>>> https://xkcd.com/1172/
>>> The problem is another module parameter is another interface to
>>> maintain and validate.
>>
>> Yup, of course, all that is understood.
>>
>> But we have this "no regressions" rule for a reason. Adhering to it
>> strictly would afaics be counter-productive in this situation, but give
>> users some way to manually do what was possible before out-of-the box
>> IMHO is the minimum we should do.
>>
>> Maybe just allow that parameter only up to a certain recent GPU
>> generation, that way you won't have to deal with that at some point in
>> the future.
> 
> The problem is the cumulative effect of all of these parameters.
> Every time there is some change in the driver someone disagrees with
> there is a push to add a module parameter for it.  The driver already
> has too many module parameters and it's hard to keep them all working
> consistently and in every possible combination.  Moreover, the module
> options are supposed to be mainly for debugging.  The driver sets
> proper defaults for all chips to ensure proper operation, however lots
> of random forums seem to treat them like they are the recipe for some
> special sauce so users are constantly setting various combinations of
> them because they read somewhere on a forum that it would make their
> GPU run faster.  More often than not this leads to problems.
> 
> Even if we did make the option only valid for these specific chips,
> there will be an expectation that future chips will support it as
> well, because someone will hack the driver and test it and it may work
> for them and then there will be a push to add it for those chips too.

I know, I fully understand this. Sorry for being a PITA. I'm just
arguing for a parameter because I think that's what I should do in this
situation due to the regression aspect and our #1 rule.

Ciao, Thorsten

>>>  Moreover, we've had a number of cases in the
>>> past where users have under or overclocked and reported bugs or
>>> stability issues and it did not come to light that they were doing
>>> that until we'd already spent a good deal of time trying to debug the
>>> issue.
>>
>> Taint the kernel when that module parameter is used? We iirc have a
>> taint bit exactly for this sort of situation. Sure, such reports will
>> still happen, but then you at least have an indicator to spot them.
>>
>> Ciao, Thorsten
>>
>>>  This obviously can still happen if you allow any sort of over
>>> or underclocking, but at least if you stick to the limits you are
>>> staying within the bounding box of the design.
>>>
>>> Alex
>>>
>>>> On 2/20/24 19:09, Alex Deucher wrote:
>>>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>>>>>> For Windows, apps like MSI Afterburner is the one to try and what most
>>>>>> people go for. Using it in the past myself, I would be surprised if it
>>>>>> adhered to such a high min power cap. But even if it did, why would we
>>>>>> have to.
>>>>>>
>>>>>> Relying on vendors cap in this case has already proven wrong because
>>>>>> things worked for quite some time already and people reported saving
>>>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>>>
>>>>>> Therefore this talk about safety seems rather strange to me and
>>>>>> especially so when we are talking about min_cap. Or name me a single
>>>>>> case where someone fried his card due to "too low power" set in said
>>>>>> variable. Now there was a report, where by going way too low, driver
>>>>>> goes opposite into max power. That's it. That can be easily
>>>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
>>>>>> safety standards with that one.
>>>>> Because operation outside of the design bounding box is undefined.  It
>>>>> might work for some boards but not others.  It's possible some of the
>>>>> logic in the firmware or some of the components used on the board may
>>>>> not work correctly below a certain limit, or the voltage regulators
>>>>> used on a specific board have a minimum requirement that would not be
>>>>> an issue if you stick the bounding box.
>>>>>
>>>>> Alex
>>>>>
>>>>>> As for solution, what some suggested already exist - a patch posted by
>>>>>> fililip on gitlab is probably the way most of you would agree. It
>>>>>> introduce a variable that can be set during boot to override min_cap.
>>>>>> But he did not pull requested it, so please, if any one of you who have
>>>>>> access to code and merge kernel would be kind enough to implement it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>
>>>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>>>> For the record and everyone that lands here: the cause is known now
>>>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>
>>>>>>>>>>>>> Other mentions:
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>>>
>>>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>>>>>>>>>> this there (but might have missed something!). From what I can see I
>>>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>>>> The change aligns the driver what has been validated on each board
>>>>>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
>>>>>>>>>>>> validated range can lead to undefined behavior and could potentially
>>>>>>>>>>>> damage your hardware.
>>>>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>>>>>>>>>> some solution here so that users that really want to, can continue to do
>>>>>>>>>>> what was possible out-of-the box before. Is that possible to realize or
>>>>>>>>>>> even supported already?
>>>>>>>>>>>
>>>>>>>>>>> And sure, those users would be running their hardware outside of its
>>>>>>>>>>> specifications. But is that different from overclocking (which the
>>>>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
>>>>>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
>>>>>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>>>>>> When the silicon validation teams set the bounding box for a device,
>>>>>>>>>> they set a range of values where it's reasonable to operate based on
>>>>>>>>>> the characteristics of the design.
>>>>>>>>>>
>>>>>>>>>> If we did want to allow extended underclocking, we need a big warning
>>>>>>>>>> in the logs at the very least.
>>>>>>>>> Requiring a module-option to be set to allow this, as well as a big
>>>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>>>> lower-bound limits are now adhered -- and thus higher then what we used
>>>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>>>
>>>>>>>> Side note: I assume those "lower bounds checking" is done round about
>>>>>>>> the same way by the Windows driver? Does that one allow users to go
>>>>>>>> lower somehow? Say after modifying the registry or something like that?
>>>>>>>> Or through external tools?
>>>>>>> Windows uses the same limit.  I'm not aware of any way to override the
>>>>>>> limit on windows off hand.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>>> Ciao, Thorsten
>>>>>>>>
>>>>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>>>>>>>>>> let me put it here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> """
>>>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>>>>>>>>>       For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>>       Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>>>>>       Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>>>>>>>>>> freedom have to stick to such very high reference for min values without
>>>>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>>>>>>>>>> or reference values here either, just a move to lower the range of
>>>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>>>> I don't know how much power you guys have over them, but please
>>>>>>>>>>>>> consider either reverting this change, or give us an option to set
>>>>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>>>>>>>>>> """
>>>>>>>>>>>>>
>>>>>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>>>>>
>>>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>>>>>>>>>> regressions; the text you find below is based on a few templates
>>>>>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>>>>>>>>> tracking bot:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>>>>>
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21 15:15                           ` Christian König
  2024-02-21 15:44                             ` Thorsten Leemhuis
@ 2024-02-21 16:47                             ` Romano
  1 sibling, 0 replies; 28+ messages in thread
From: Romano @ 2024-02-21 16:47 UTC (permalink / raw)
  To: Christian König, Linux regressions mailing list, Alex Deucher
  Cc: Hans de Goede, Alex Deucher, Christian König, Pan, Xinhui,
	Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

So that's what its about. Somehow I knew it all along. Not long ago, I 
posted this on reddit:

https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/

That was 3 months ago. Now suddenly AMD *require*("..hardware engineers 
have explicitly pointed out that we *have to* do this in the software 
stack", "..open source drivers are *not respecting* the min value") that 
you fix min_cap hard. What a coincidence, with unlimited min_cap being 
fine for years. What is the statistical probability that just 3 months 
after someone blow it into public, they patch the driver to "respect 
firmware" and thing is suddenly "out of specs". I don't know.

Maybe someone remind them we do not work for them and we don't care for 
their marketing strategies. What is it to them if user can override 
kernel value? Card literally cannot die, I cannot fry it, I am not going 
over peak allowed power but opposite. Whether I have to do it via boot 
or patch a kernel should not be up to them.

By all means set min_cap to specification, but do allow for override. 
That's all we ask, there are many users and more issues about this 
opened on gitlab than just mine.

Also the points about vendors knowing and defining range due to HW 
components... please. How come they haven't seen such a huge savings 
when HW is clearly operating safe? To me this is intentional and have to 
do about how bad may future HW sell, like an arbitrary hold-back to milk 
money(kind of what Intel did for many CPU generations). Because they 
don't have much room on the die shrink and other optimizations also 
happened over decades. Things are saturated everywhere. You can show 
nice upgrade charts on how next gen. beat 6700XT if it draw ~200W, but 
can next gen beat it to warrant purchase if my 6700XT get almost same 
perf., but on **80% less** power? That efficiency ratio is so blown out 
it leave few jaws on the floor and 2-3 generations obsolete. No wonder 
they have to fake specs. I don't believe them a word, cards operate 
safely and this could not be missed.




On 2/21/24 16:15, Christian König wrote:
> Am 21.02.24 um 07:06 schrieb Linux regression tracking (Thorsten 
> Leemhuis):
>> On 20.02.24 21:18, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>>>> If the increased low range is allowed via boot option, like in 
>>>> proposed
>>>> patch, user clearly made an intentional decision. Undefined, but won't
>>>> fry his hardware for sure. Undefined is also overclocking in that
>>>> matter. You can go out of range with ratio of voltage vs 
>>>> frequency(still
>>>> within vendor's limits) for example and crash the system.
>>> This whole thing reminds me of this:
>>> https://xkcd.com/1172/
>>> The problem is another module parameter is another interface to
>>> maintain and validate.
>> Yup, of course, all that is understood.
>>
>> But we have this "no regressions" rule for a reason. Adhering to it
>> strictly would afaics be counter-productive in this situation, but give
>> users some way to manually do what was possible before out-of-the box
>> IMHO is the minimum we should do.
>>
>> Maybe just allow that parameter only up to a certain recent GPU
>> generation, that way you won't have to deal with that at some point in
>> the future.
>>
>>>   Moreover, we've had a number of cases in the
>>> past where users have under or overclocked and reported bugs or
>>> stability issues and it did not come to light that they were doing
>>> that until we'd already spent a good deal of time trying to debug the
>>> issue.
>> Taint the kernel when that module parameter is used? We iirc have a
>> taint bit exactly for this sort of situation. Sure, such reports will
>> still happen, but then you at least have an indicator to spot them.
>
> Let me recap what happened here:
>
> 1. AMD is the GPU manufacturer, but apart from a few exceptions 
> doesn't assemble boards.
>
> 2. Vendors take AMDs GPUs and assemble them together with power 
> regulators, memory and a bunch of other components into PCIe board.
>
> 3. AMD provides a vendor agnostic driver and for this to work vendors 
> describe to the min/max voltage their power regulators can do in some 
> flash memory.
>
> 4. Hardware engineers point out that AMDs open source drivers are not 
> respecting the min value.
>
> 5. In response a patch was applied to respect that value and not use 
> something outside of the hardware specification the vendor provided.
>
> I'm not sure about it but I think AMD need to respect the min/max 
> values simply by contract and it's not really an option to not do that.
>
> If someone really want to run your hardware outside the vendor 
> recommended values that person can still patch the driver to ignore 
> the limits. It's just that then AMD is not responsible for any damage 
> resulting from that.
>
> So as far as I can see the request to make that a module option is a 
> no-go, especially since hardware engineers have explicitly pointed out 
> that we have to do this in the software stack.
>
> Regards,
> Christian.
>
>>
>> Ciao, Thorsten
>>
>>>   This obviously can still happen if you allow any sort of over
>>> or underclocking, but at least if you stick to the limits you are
>>> staying within the bounding box of the design.
>>>
>>> Alex
>>>
>>>> On 2/20/24 19:09, Alex Deucher wrote:
>>>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>>>>>> For Windows, apps like MSI Afterburner is the one to try and what 
>>>>>> most
>>>>>> people go for. Using it in the past myself, I would be surprised 
>>>>>> if it
>>>>>> adhered to such a high min power cap. But even if it did, why 
>>>>>> would we
>>>>>> have to.
>>>>>>
>>>>>> Relying on vendors cap in this case has already proven wrong because
>>>>>> things worked for quite some time already and people reported saving
>>>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>>>
>>>>>> Therefore this talk about safety seems rather strange to me and
>>>>>> especially so when we are talking about min_cap. Or name me a single
>>>>>> case where someone fried his card due to "too low power" set in said
>>>>>> variable. Now there was a report, where by going way too low, driver
>>>>>> goes opposite into max power. That's it. That can be easily
>>>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>>>> protect HW(also above scenario), not a min_cap. Feel free to 
>>>>>> adhere to
>>>>>> safety standards with that one.
>>>>> Because operation outside of the design bounding box is 
>>>>> undefined.  It
>>>>> might work for some boards but not others.  It's possible some of the
>>>>> logic in the firmware or some of the components used on the board may
>>>>> not work correctly below a certain limit, or the voltage regulators
>>>>> used on a specific board have a minimum requirement that would not be
>>>>> an issue if you stick the bounding box.
>>>>>
>>>>> Alex
>>>>>
>>>>>> As for solution, what some suggested already exist - a patch 
>>>>>> posted by
>>>>>> fililip on gitlab is probably the way most of you would agree. It
>>>>>> introduce a variable that can be set during boot to override 
>>>>>> min_cap.
>>>>>> But he did not pull requested it, so please, if any one of you 
>>>>>> who have
>>>>>> access to code and merge kernel would be kind enough to implement 
>>>>>> it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking 
>>>>>>> (Thorsten
>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking 
>>>>>>>>>> (Thorsten
>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking 
>>>>>>>>>>>> (Thorsten
>>>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for 
>>>>>>>>>>>>>>> my GPU (RX 6700XT,
>>>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as 
>>>>>>>>>>>>>>> before(to 115W),
>>>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have 
>>>>>>>>>>>>>>> a variable read-only
>>>>>>>>>>>>>>> even for root. This is not of above apps issue but of 
>>>>>>>>>>>>>>> the kernel, I read
>>>>>>>>>>>>>>> similar issues from other bug reports of above apps. I 
>>>>>>>>>>>>>>> downgraded to v6.6.10
>>>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>>>> For the record and everyone that lands here: the cause is 
>>>>>>>>>>>>> known now
>>>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting 
>>>>>>>>>>>>> power1_cap_min
>>>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>
>>>>>>>>>>>>> Other mentions:
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>>>
>>>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now 
>>>>>>>>>>>>> CCed) yet on
>>>>>>>>>>>>> this there (but might have missed something!). From what I 
>>>>>>>>>>>>> can see I
>>>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a 
>>>>>>>>>>>>> revert
>>>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>>>> The change aligns the driver what has been validated on 
>>>>>>>>>>>> each board
>>>>>>>>>>>> design.  Windows uses the same limits. Using values lower 
>>>>>>>>>>>> than the
>>>>>>>>>>>> validated range can lead to undefined behavior and could 
>>>>>>>>>>>> potentially
>>>>>>>>>>>> damage your hardware.
>>>>>>>>>>> Thx for the reply! Yeah, I was expecting something along 
>>>>>>>>>>> those lines.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of 
>>>>>>>>>>> many users.
>>>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we 
>>>>>>>>>>> can find
>>>>>>>>>>> some solution here so that users that really want to, can 
>>>>>>>>>>> continue to do
>>>>>>>>>>> what was possible out-of-the box before. Is that possible to 
>>>>>>>>>>> realize or
>>>>>>>>>>> even supported already?
>>>>>>>>>>>
>>>>>>>>>>> And sure, those users would be running their hardware 
>>>>>>>>>>> outside of its
>>>>>>>>>>> specifications. But is that different from overclocking 
>>>>>>>>>>> (which the
>>>>>>>>>>> driver allows, doesn't it? If not by all means please 
>>>>>>>>>>> correct me!)?
>>>>>>>>>> Sure.  The driver has always had upper bound limits for 
>>>>>>>>>> overclocking,
>>>>>>>>>> this change adds lower bounds checking for underclocking as 
>>>>>>>>>> well.
>>>>>>>>>> When the silicon validation teams set the bounding box for a 
>>>>>>>>>> device,
>>>>>>>>>> they set a range of values where it's reasonable to operate 
>>>>>>>>>> based on
>>>>>>>>>> the characteristics of the design.
>>>>>>>>>>
>>>>>>>>>> If we did want to allow extended underclocking, we need a big 
>>>>>>>>>> warning
>>>>>>>>>> in the logs at the very least.
>>>>>>>>> Requiring a module-option to be set to allow this, as well as 
>>>>>>>>> a big
>>>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>>>> lower-bound limits are now adhered -- and thus higher then what 
>>>>>>>> we used
>>>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>>>
>>>>>>>> Side note: I assume those "lower bounds checking" is done round 
>>>>>>>> about
>>>>>>>> the same way by the Windows driver? Does that one allow users 
>>>>>>>> to go
>>>>>>>> lower somehow? Say after modifying the registry or something 
>>>>>>>> like that?
>>>>>>>> Or through external tools?
>>>>>>> Windows uses the same limit.  I'm not aware of any way to 
>>>>>>> override the
>>>>>>> limit on windows off hand.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>>> Ciao, Thorsten
>>>>>>>>
>>>>>>>>>>>>> Roman posted something that apparently was meant to go to 
>>>>>>>>>>>>> the list, so
>>>>>>>>>>>>> let me put it here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> """
>>>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to 
>>>>>>>>>>>>> be merged,
>>>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using 
>>>>>>>>>>>>> original addr.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> it seems that commit was already found(see user's 
>>>>>>>>>>>>>> 'fililip' comment):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Support for getting power1_cap_min value on smu13 
>>>>>>>>>>>>>> and smu11.
>>>>>>>>>>>>>>        For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>>        Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>>>>>        Signed-off-by: Alex Deucher 
>>>>>>>>>>>>>> <alexander.deucher@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, this is not good as it remove under-powering 
>>>>>>>>>>>>>> range too far. I
>>>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less 
>>>>>>>>>>>>> consumption
>>>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of 
>>>>>>>>>>>>> options and
>>>>>>>>>>>>> freedom have to stick to such very high reference for min 
>>>>>>>>>>>>> values without
>>>>>>>>>>>>> ability to override them through some sys ctrls. Commit 
>>>>>>>>>>>>> was done by amd
>>>>>>>>>>>>> guy and I wonder if because of maybe this post that I made 
>>>>>>>>>>>>> few months
>>>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/ 
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand 
>>>>>>>>>>>>>> desire to
>>>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when 
>>>>>>>>>>>>> card pull on
>>>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk 
>>>>>>>>>>>>> about default
>>>>>>>>>>>>> or reference values here either, just a move to lower the 
>>>>>>>>>>>>> range of
>>>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>>>> I don't know how much power you guys have over them, but 
>>>>>>>>>>>>>> please
>>>>>>>>>>>>> consider either reverting this change, or give us an 
>>>>>>>>>>>>> option to set
>>>>>>>>>>>>> min_cap through say /sys (right now param is readonly, 
>>>>>>>>>>>>> even for root).
>>>>>>>>>>>>>> Thank you in advance for looking into this, with 
>>>>>>>>>>>>>> regards:  Romano
>>>>>>>>>>>>> """
>>>>>>>>>>>>>
>>>>>>>>>>>>> And while at it, let me add this issue to the tracking as 
>>>>>>>>>>>>> well
>>>>>>>>>>>>>
>>>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux 
>>>>>>>>>>>>> kernel
>>>>>>>>>>>>> regressions; the text you find below is based on a few 
>>>>>>>>>>>>> templates
>>>>>>>>>>>>> paragraphs you might have encountered already in similar 
>>>>>>>>>>>>> form.
>>>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall 
>>>>>>>>>>>>> through the
>>>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux 
>>>>>>>>>>>>> kernel regression
>>>>>>>>>>>>> tracking bot:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression 
>>>>>>>>>>>>> tracker' hat)
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Everything you wanna know about Linux kernel regression 
>>>>>>>>>>>>> tracking:
>>>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>>>> That page also explains what to do if mails like this 
>>>>>>>>>>>>> annoy you.
>>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21 15:39                           ` Alex Deucher
  2024-02-21 15:53                             ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-21 17:45                             ` Romano
  2024-02-26 13:04                             ` Daniel Vetter
  2 siblings, 0 replies; 28+ messages in thread
From: Romano @ 2024-02-21 17:45 UTC (permalink / raw)
  To: Alex Deucher, Linux regressions mailing list
  Cc: Hans de Goede, Alex Deucher, Christian König, Pan, Xinhui,
	Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

He is my proposal:

On boot, read chip values into min_cap, default_cap, max_cap and set 
them, satisfying AMD's requirement.

Do not introduce any new boot flags, keeping things simple.

Keep def_cap and max_cap readonly to protect HW.

Make min_cap readwrite: "echo 1234 > /sys/...min_cap".

No limitation to specific HW, just general.

This way you did your job fulfilling AMD's request. If user change min, 
it was his intention anyway - HW is safe with this option. You won't see 
any false bug reports because this does not introduce instability, 
unlike OC.

As a side note, it seems Windows does not allow lower than vendors min 
as well, even via Afterburner. This may seem like "you see, they too 
follow specs", but I see it more positive for Linux. We already have a 
mesa which is on par, if not better than Nvidia driver. It is generally 
known that AMD is great on Linux. Now if on top of that Windows users 
find that not only can they get better, limitless drivers, but also 
significant - out of charts efficiency and power savings, this make 
Linux only more attractive to Windows users and make an adoption faster.



On 2/21/24 16:39, Alex Deucher wrote:
> On Wed, Feb 21, 2024 at 1:06 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>> On 20.02.24 21:18, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>>>> If the increased low range is allowed via boot option, like in proposed
>>>> patch, user clearly made an intentional decision. Undefined, but won't
>>>> fry his hardware for sure. Undefined is also overclocking in that
>>>> matter. You can go out of range with ratio of voltage vs frequency(still
>>>> within vendor's limits) for example and crash the system.
>>> This whole thing reminds me of this:
>>> https://xkcd.com/1172/
>>> The problem is another module parameter is another interface to
>>> maintain and validate.
>> Yup, of course, all that is understood.
>>
>> But we have this "no regressions" rule for a reason. Adhering to it
>> strictly would afaics be counter-productive in this situation, but give
>> users some way to manually do what was possible before out-of-the box
>> IMHO is the minimum we should do.
>>
>> Maybe just allow that parameter only up to a certain recent GPU
>> generation, that way you won't have to deal with that at some point in
>> the future.
> The problem is the cumulative effect of all of these parameters.
> Every time there is some change in the driver someone disagrees with
> there is a push to add a module parameter for it.  The driver already
> has too many module parameters and it's hard to keep them all working
> consistently and in every possible combination.  Moreover, the module
> options are supposed to be mainly for debugging.  The driver sets
> proper defaults for all chips to ensure proper operation, however lots
> of random forums seem to treat them like they are the recipe for some
> special sauce so users are constantly setting various combinations of
> them because they read somewhere on a forum that it would make their
> GPU run faster.  More often than not this leads to problems.
>
> Even if we did make the option only valid for these specific chips,
> there will be an expectation that future chips will support it as
> well, because someone will hack the driver and test it and it may work
> for them and then there will be a push to add it for those chips too.
>
> Alex
>
>>>   Moreover, we've had a number of cases in the
>>> past where users have under or overclocked and reported bugs or
>>> stability issues and it did not come to light that they were doing
>>> that until we'd already spent a good deal of time trying to debug the
>>> issue.
>> Taint the kernel when that module parameter is used? We iirc have a
>> taint bit exactly for this sort of situation. Sure, such reports will
>> still happen, but then you at least have an indicator to spot them.
>>
>> Ciao, Thorsten
>>
>>>   This obviously can still happen if you allow any sort of over
>>> or underclocking, but at least if you stick to the limits you are
>>> staying within the bounding box of the design.
>>>
>>> Alex
>>>
>>>> On 2/20/24 19:09, Alex Deucher wrote:
>>>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
>>>>>> For Windows, apps like MSI Afterburner is the one to try and what most
>>>>>> people go for. Using it in the past myself, I would be surprised if it
>>>>>> adhered to such a high min power cap. But even if it did, why would we
>>>>>> have to.
>>>>>>
>>>>>> Relying on vendors cap in this case has already proven wrong because
>>>>>> things worked for quite some time already and people reported saving
>>>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>>>
>>>>>> Therefore this talk about safety seems rather strange to me and
>>>>>> especially so when we are talking about min_cap. Or name me a single
>>>>>> case where someone fried his card due to "too low power" set in said
>>>>>> variable. Now there was a report, where by going way too low, driver
>>>>>> goes opposite into max power. That's it. That can be easily
>>>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
>>>>>> safety standards with that one.
>>>>> Because operation outside of the design bounding box is undefined.  It
>>>>> might work for some boards but not others.  It's possible some of the
>>>>> logic in the firmware or some of the components used on the board may
>>>>> not work correctly below a certain limit, or the voltage regulators
>>>>> used on a specific board have a minimum requirement that would not be
>>>>> an issue if you stick the bounding box.
>>>>>
>>>>> Alex
>>>>>
>>>>>> As for solution, what some suggested already exist - a patch posted by
>>>>>> fililip on gitlab is probably the way most of you would agree. It
>>>>>> introduce a variable that can be set during boot to override min_cap.
>>>>>> But he did not pull requested it, so please, if any one of you who have
>>>>>> access to code and merge kernel would be kind enough to implement it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
>>>>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
>>>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
>>>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
>>>>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
>>>>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
>>>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>>>> For the record and everyone that lands here: the cause is known now
>>>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
>>>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>
>>>>>>>>>>>>> Other mentions:
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>>>
>>>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
>>>>>>>>>>>>> this there (but might have missed something!). From what I can see I
>>>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
>>>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>>>> The change aligns the driver what has been validated on each board
>>>>>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
>>>>>>>>>>>> validated range can lead to undefined behavior and could potentially
>>>>>>>>>>>> damage your hardware.
>>>>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
>>>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
>>>>>>>>>>> some solution here so that users that really want to, can continue to do
>>>>>>>>>>> what was possible out-of-the box before. Is that possible to realize or
>>>>>>>>>>> even supported already?
>>>>>>>>>>>
>>>>>>>>>>> And sure, those users would be running their hardware outside of its
>>>>>>>>>>> specifications. But is that different from overclocking (which the
>>>>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
>>>>>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
>>>>>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>>>>>> When the silicon validation teams set the bounding box for a device,
>>>>>>>>>> they set a range of values where it's reasonable to operate based on
>>>>>>>>>> the characteristics of the design.
>>>>>>>>>>
>>>>>>>>>> If we did want to allow extended underclocking, we need a big warning
>>>>>>>>>> in the logs at the very least.
>>>>>>>>> Requiring a module-option to be set to allow this, as well as a big
>>>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>>>> lower-bound limits are now adhered -- and thus higher then what we used
>>>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>>>
>>>>>>>> Side note: I assume those "lower bounds checking" is done round about
>>>>>>>> the same way by the Windows driver? Does that one allow users to go
>>>>>>>> lower somehow? Say after modifying the registry or something like that?
>>>>>>>> Or through external tools?
>>>>>>> Windows uses the same limit.  I'm not aware of any way to override the
>>>>>>> limit on windows off hand.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>>> Ciao, Thorsten
>>>>>>>>
>>>>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
>>>>>>>>>>>>> let me put it here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> """
>>>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
>>>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Support for getting power1_cap_min value on smu13 and smu11.
>>>>>>>>>>>>>>        For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
>>>>>>>>>>>>>>        Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
>>>>>>>>>>>>>>        Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
>>>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
>>>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
>>>>>>>>>>>>> freedom have to stick to such very high reference for min values without
>>>>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
>>>>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
>>>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
>>>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
>>>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
>>>>>>>>>>>>> or reference values here either, just a move to lower the range of
>>>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>>>> I don't know how much power you guys have over them, but please
>>>>>>>>>>>>> consider either reverting this change, or give us an option to set
>>>>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
>>>>>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
>>>>>>>>>>>>> """
>>>>>>>>>>>>>
>>>>>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>>>>>
>>>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>>>>>>>>>>> regressions; the text you find below is based on a few templates
>>>>>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>>>>>>>>> tracking bot:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>>>> That page also explains what to do if mails like this annoy you.
>>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21 15:39                           ` Alex Deucher
  2024-02-21 15:53                             ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-21 17:45                             ` Romano
@ 2024-02-26 13:04                             ` Daniel Vetter
  2 siblings, 0 replies; 28+ messages in thread
From: Daniel Vetter @ 2024-02-26 13:04 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Linux regressions mailing list, Romano, Hans de Goede,
	Alex Deucher, Christian König, Pan, Xinhui, Ma Jun, amd-gfx,
	Dave Airlie, Greg KH

Back from vacations ...

On Wed, 21 Feb 2024 at 16:39, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Feb 21, 2024 at 1:06 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
> >
> > On 20.02.24 21:18, Alex Deucher wrote:
> > > On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
> > >>
> > >> If the increased low range is allowed via boot option, like in proposed
> > >> patch, user clearly made an intentional decision. Undefined, but won't
> > >> fry his hardware for sure. Undefined is also overclocking in that
> > >> matter. You can go out of range with ratio of voltage vs frequency(still
> > >> within vendor's limits) for example and crash the system.
> > >
> > > This whole thing reminds me of this:
> > > https://xkcd.com/1172/
> > > The problem is another module parameter is another interface to
> > > maintain and validate.
> >
> > Yup, of course, all that is understood.
> >
> > But we have this "no regressions" rule for a reason. Adhering to it
> > strictly would afaics be counter-productive in this situation, but give
> > users some way to manually do what was possible before out-of-the box
> > IMHO is the minimum we should do.
> >
> > Maybe just allow that parameter only up to a certain recent GPU
> > generation, that way you won't have to deal with that at some point in
> > the future.
>
> The problem is the cumulative effect of all of these parameters.
> Every time there is some change in the driver someone disagrees with
> there is a push to add a module parameter for it.  The driver already
> has too many module parameters and it's hard to keep them all working
> consistently and in every possible combination.  Moreover, the module
> options are supposed to be mainly for debugging.  The driver sets
> proper defaults for all chips to ensure proper operation, however lots
> of random forums seem to treat them like they are the recipe for some
> special sauce so users are constantly setting various combinations of
> them because they read somewhere on a forum that it would make their
> GPU run faster.  More often than not this leads to problems.
>
> Even if we did make the option only valid for these specific chips,
> there will be an expectation that future chips will support it as
> well, because someone will hack the driver and test it and it may work
> for them and then there will be a push to add it for those chips too.

Chiming in here ...

tldr; yes

gpu drivers are ridiculously hard to get right, combinatorial
explosion is a real issue and concern, it's not some hiding behind
corporate rules - drm folks added module_param*unsafe to discourage
users from playing around with options we need for debugging for very,
very real reasons. We have aggressively removed tuning knobs in the
past, and those we have in various drivers are causing endless amounts
of pain.

Also, the "no regression" rules is not ironclad, especially on
power/perf regressions, or all the security fixes would be impossible
to merge. First make it correct (even if the bug has gone unnoticed
for forever), then make it fast/power efficient/pretty/whatever people
fancy. Yes there's some exceptions like "my desktop is crawling like a
slide-show and absolutely unusable" kind of regressions, but my
understanding is this isn't the case here.

So unless Dave or Linus are screaming and overruling Alex here, "do
nothing" is my take here too.

Cheers, Sima

>
> Alex
>
> > >  Moreover, we've had a number of cases in the
> > > past where users have under or overclocked and reported bugs or
> > > stability issues and it did not come to light that they were doing
> > > that until we'd already spent a good deal of time trying to debug the
> > > issue.
> >
> > Taint the kernel when that module parameter is used? We iirc have a
> > taint bit exactly for this sort of situation. Sure, such reports will
> > still happen, but then you at least have an indicator to spot them.
> >
> > Ciao, Thorsten
> >
> > >  This obviously can still happen if you allow any sort of over
> > > or underclocking, but at least if you stick to the limits you are
> > > staying within the bounding box of the design.
> > >
> > > Alex
> > >
> > >> On 2/20/24 19:09, Alex Deucher wrote:
> > >>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@gmail.com> wrote:
> > >>>> For Windows, apps like MSI Afterburner is the one to try and what most
> > >>>> people go for. Using it in the past myself, I would be surprised if it
> > >>>> adhered to such a high min power cap. But even if it did, why would we
> > >>>> have to.
> > >>>>
> > >>>> Relying on vendors cap in this case has already proven wrong because
> > >>>> things worked for quite some time already and people reported saving
> > >>>> significant amount of watts, in my case 90W(!) for <10% perf.
> > >>>>
> > >>>> Therefore this talk about safety seems rather strange to me and
> > >>>> especially so when we are talking about min_cap. Or name me a single
> > >>>> case where someone fried his card due to "too low power" set in said
> > >>>> variable. Now there was a report, where by going way too low, driver
> > >>>> goes opposite into max power. That's it. That can be easily
> > >>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
> > >>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to
> > >>>> safety standards with that one.
> > >>> Because operation outside of the design bounding box is undefined.  It
> > >>> might work for some boards but not others.  It's possible some of the
> > >>> logic in the firmware or some of the components used on the board may
> > >>> not work correctly below a certain limit, or the voltage regulators
> > >>> used on a specific board have a minimum requirement that would not be
> > >>> an issue if you stick the bounding box.
> > >>>
> > >>> Alex
> > >>>
> > >>>> As for solution, what some suggested already exist - a patch posted by
> > >>>> fililip on gitlab is probably the way most of you would agree. It
> > >>>> introduce a variable that can be set during boot to override min_cap.
> > >>>> But he did not pull requested it, so please, if any one of you who have
> > >>>> access to code and merge kernel would be kind enough to implement it.
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 2/20/24 16:46, Alex Deucher wrote:
> > >>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
> > >>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> > >>>>>>
> > >>>>>> On 20.02.24 16:27, Hans de Goede wrote:
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
> > >>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten
> > >>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> > >>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
> > >>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten
> > >>>>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
> > >>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
> > >>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
> > >>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT,
> > >>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W),
> > >>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only
> > >>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read
> > >>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10
> > >>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
> > >>>>>>>>>>> For the record and everyone that lands here: the cause is known now
> > >>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min
> > >>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
> > >>>>>>>>>>>
> > >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> > >>>>>>>>>>>
> > >>>>>>>>>>> Other mentions:
> > >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
> > >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
> > >>>>>>>>>>>
> > >>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on
> > >>>>>>>>>>> this there (but might have missed something!). From what I can see I
> > >>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert
> > >>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
> > >>>>>>>>>> The change aligns the driver what has been validated on each board
> > >>>>>>>>>> design.  Windows uses the same limits.  Using values lower than the
> > >>>>>>>>>> validated range can lead to undefined behavior and could potentially
> > >>>>>>>>>> damage your hardware.
> > >>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines.
> > >>>>>>>>>
> > >>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users.
> > >>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find
> > >>>>>>>>> some solution here so that users that really want to, can continue to do
> > >>>>>>>>> what was possible out-of-the box before. Is that possible to realize or
> > >>>>>>>>> even supported already?
> > >>>>>>>>>
> > >>>>>>>>> And sure, those users would be running their hardware outside of its
> > >>>>>>>>> specifications. But is that different from overclocking (which the
> > >>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)?
> > >>>>>>>> Sure.  The driver has always had upper bound limits for overclocking,
> > >>>>>>>> this change adds lower bounds checking for underclocking as well.
> > >>>>>>>> When the silicon validation teams set the bounding box for a device,
> > >>>>>>>> they set a range of values where it's reasonable to operate based on
> > >>>>>>>> the characteristics of the design.
> > >>>>>>>>
> > >>>>>>>> If we did want to allow extended underclocking, we need a big warning
> > >>>>>>>> in the logs at the very least.
> > >>>>>>> Requiring a module-option to be set to allow this, as well as a big
> > >>>>>>> warning in the logs sounds like a good solution to me.
> > >>>>>> Yeah, especially as it sounds from some of the reports as if some
> > >>>>>> vendors did a really bad job when it came to setting the proper
> > >>>>>> lower-bound limits are now adhered -- and thus higher then what we used
> > >>>>>> out-of-the box before 1958946858a62b was applied.
> > >>>>>>
> > >>>>>> Side note: I assume those "lower bounds checking" is done round about
> > >>>>>> the same way by the Windows driver? Does that one allow users to go
> > >>>>>> lower somehow? Say after modifying the registry or something like that?
> > >>>>>> Or through external tools?
> > >>>>> Windows uses the same limit.  I'm not aware of any way to override the
> > >>>>> limit on windows off hand.
> > >>>>>
> > >>>>> Alex
> > >>>>>
> > >>>>>
> > >>>>>> Ciao, Thorsten
> > >>>>>>
> > >>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so
> > >>>>>>>>>>> let me put it here:
> > >>>>>>>>>>>
> > >>>>>>>>>>> """
> > >>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged,
> > >>>>>>>>>>> discussion is on gitlab link below.
> > >>>>>>>>>>>
> > >>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment):
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
> > >>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
> > >>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@amd.com>
> > >>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       drm/amd/pm: Support for getting power1_cap_min value
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       Support for getting power1_cap_min value on smu13 and smu11.
> > >>>>>>>>>>>>       For other Asics, we still use 0 as the default value.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
> > >>>>>>>>>>>>       Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
> > >>>>>>>>>>>>       Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I
> > >>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption
> > >>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and
> > >>>>>>>>>>> freedom have to stick to such very high reference for min values without
> > >>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd
> > >>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months
> > >>>>>>>>>>> ago(business strategy?):
> > >>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
> > >>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to
> > >>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on
> > >>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default
> > >>>>>>>>>>> or reference values here either, just a move to lower the range of
> > >>>>>>>>>>> options for whatever reason.
> > >>>>>>>>>>>> I don't know how much power you guys have over them, but please
> > >>>>>>>>>>> consider either reverting this change, or give us an option to set
> > >>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root).
> > >>>>>>>>>>>> Thank you in advance for looking into this, with regards:  Romano
> > >>>>>>>>>>> """
> > >>>>>>>>>>>
> > >>>>>>>>>>> And while at it, let me add this issue to the tracking as well
> > >>>>>>>>>>>
> > >>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
> > >>>>>>>>>>> regressions; the text you find below is based on a few templates
> > >>>>>>>>>>> paragraphs you might have encountered already in similar form.
> > >>>>>>>>>>> See link in footer if these mails annoy you.]
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
> > >>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> > >>>>>>>>>>> tracking bot:
> > >>>>>>>>>>>
> > >>>>>>>>>>> #regzbot introduced 1958946858a62b /
> > >>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
> > >>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> > >>>>>>>>>>> That page also explains what to do if mails like this annoy you.
> > >>>>>>>
> > >
> > >



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)
  2024-02-21 15:53                             ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-03-04 14:12                               ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 28+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-03-04 14:12 UTC (permalink / raw)
  To: Alex Deucher, Linux regressions mailing list
  Cc: Romano, Hans de Goede, Alex Deucher, Christian König, Pan,
	Xinhui, Ma Jun, amd-gfx, Dave Airlie, Daniel Vetter, Greg KH

On 21.02.24 16:53, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 21.02.24 16:39, Alex Deucher wrote:
>> On Wed, Feb 21, 2024 at 1:06 AM Linux regression tracking (Thorsten
>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>
>>> On 20.02.24 21:18, Alex Deucher wrote:
>>>> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@gmail.com> wrote:
>>>>>
>>>>> If the increased low range is allowed via boot option, like in proposed
>>>>> patch, user clearly made an intentional decision. Undefined, but won't
>>>>> fry his hardware for sure. Undefined is also overclocking in that
>>>>> matter. You can go out of range with ratio of voltage vs frequency(still
>>>>> within vendor's limits) for example and crash the system.
>>>
>>> But we have this "no regressions" rule for a reason. Adhering to it
>>> strictly would afaics be counter-productive in this situation, but give
>>> users some way to manually do what was possible before out-of-the box
>>> IMHO is the minimum we should do.
> [...]

TWIMC, I mentioned this twice in mails to Linus, he didn't get involved,
so I assume things are fine the way they are for him. And then it's of
course totally fine for me, too. :-D

Thx again for all your help and sorry for causing trouble, but in my
line of work these "might or might not be a regression from Linus
viewpoint, so let's get him involved" sometimes just happen.

Ciao, Thorsten

#regzbot resolve: apparently not a regression from Linus viewpoint

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-03-04 14:12 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-17 13:01 Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu) Roman Benes
2024-02-17 13:30 ` Greg KH
2024-02-19 11:15   ` Linux regression tracking (Thorsten Leemhuis)
2024-02-19 11:31     ` Roman Benes
2024-02-19 11:35     ` Romano
2024-02-20 14:45     ` Alex Deucher
2024-02-20 15:03       ` Linux regression tracking (Thorsten Leemhuis)
2024-02-20 15:15         ` Alex Deucher
2024-02-20 15:26           ` Christian König
2024-02-20 15:27           ` Hans de Goede
2024-02-20 15:42             ` Linux regression tracking (Thorsten Leemhuis)
2024-02-20 15:46               ` Alex Deucher
2024-02-20 16:46                 ` Romano
2024-02-20 18:09                   ` Alex Deucher
2024-02-20 19:41                     ` Romano
2024-02-20 20:18                       ` Alex Deucher
2024-02-20 21:30                         ` Romano
2024-02-21  6:06                         ` Linux regression tracking (Thorsten Leemhuis)
2024-02-21 15:15                           ` Christian König
2024-02-21 15:44                             ` Thorsten Leemhuis
2024-02-21 16:47                             ` Romano
2024-02-21 15:39                           ` Alex Deucher
2024-02-21 15:53                             ` Linux regression tracking (Thorsten Leemhuis)
2024-03-04 14:12                               ` Linux regression tracking (Thorsten Leemhuis)
2024-02-21 17:45                             ` Romano
2024-02-26 13:04                             ` Daniel Vetter
2024-02-20 18:14             ` Alex Deucher
2024-02-20 11:20 ` Linux regression tracking #adding (Thorsten Leemhuis)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.