All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Lazar, Lijo" <lijo.lazar@amd.com>
To: Luben Tuikov <luben.tuikov@amd.com>,
	Alex Deucher <alexdeucher@gmail.com>
Cc: "Russell, Kent" <Kent.Russell@amd.com>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Quan, Evan" <Evan.Quan@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>
Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
Date: Wed, 27 Oct 2021 10:50:39 +0530	[thread overview]
Message-ID: <d0ed1ed4-42e8-4c05-f027-7bbb5486909b@amd.com> (raw)
In-Reply-To: <04c721d1-0561-fef8-263e-e492499ef77b@amd.com>



On 10/27/2021 3:30 AM, Luben Tuikov wrote:
> On 2021-10-26 17:26, Alex Deucher wrote:
>> On Tue, Oct 19, 2021 at 9:54 AM Luben Tuikov <luben.tuikov@amd.com> wrote:
>>> It again fails with the same message!
>>> But this time it is different!
>>> Here's why:
>>>
>>> openat(AT_FDCWD, "/sys/class/drm/card0/device/pp_dpm_fclk", O_RDONLY) = 3
>>> read(3, "0: 571Mhz \n1: 1274Mhz *\n2: 1221M"..., 8191) = 36
>>> read(3, "", 8191)                       = 0
>>> close(3)                                = 0
>>> write(2, "python3: /home/ltuikov/proj/amd/"..., 220python3: /home/ltuikov/proj/amd/rocm_smi_lib/src/rocm_smi.cc:913: rsmi_status_t get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, uint32_t*): Assertion `f->frequency[i-1] <= f->frequency[i]' failed.
>>> ) = 220
>>> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f531f9bc000
>>> rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
>>> rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
>>> getpid()                                = 37861
>>> gettid()                                = 37861
>>> tgkill(37861, 37861, SIGABRT)           = 0
>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>> --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=37861, si_uid=1000} ---
>>> +++ killed by SIGABRT (core dumped) +++
>>> Aborted (core dumped)
>>> $cat /sys/class/drm/card0/device/pp_dpm_fclk
>>> 0: 571Mhz
>>> 1: 1274Mhz *
>>> 2: 1221Mhz
>>> $_
>>>
>>> Why is the mid frequency larger than the last?
>>> Why does get_frequencies() insists that they be ordered when they're not? (Does the tool need fixing or the kernel?)
>>>
>>> The current patchset doesn't report 0, and doesn't report any current if 0 would've been reported as current. But anything else is reported as it would've been reported before the patch. And I tested it with vanilla amd-staging-drm-next--same thing.
>>>
>> Seems to crash both ways.  I'd rather either:
>> 1. Remove the * when the clock is outside of the min and max ranges
>> or
>> 2. Clamp the clock to the max or min if it's above or below.
>>
>> And then fix the tools accordingly.  Those seem like the choices of
>> least surprise considering the interface is supposed to show the
>> current and available DPM levels.
> 
> So, if we get a case such as the one above, we remove the whole line at 1:, not just the asterisk, right? (for option 1 above).
> The rocm-smi lib would fail if they're out of order, so only removing the * char would still confuse the tool.
> 
> What does it mean when the current frequency (line 1:) is above the "max" (line 2:) as shown in my output above?
> 
> Do we perhaps want to sort them and report current still, and swap lines 1 and 2?
> 
> Or should we simply remove the ordering requirement in rocm-smi lib?
> 
These nodes help to find the current state of ASIC. Clamping the values 
will just erase questions like these and possible improvements/fixes.

For ex: if the situation above is a case of OD (this is only example, 
don't know what is really the case above) that goes beyond regular DPM 
min/max levels, then we could add + as improvement.

IMO, the node should reflect the current state of ASIC and masking the 
values shouldn't be done. Other cases could be 0 or FW handshake 
failures where * itself will be missing.

Thanks,
Lijo

> Regards,
> Luben
> 
>>
>> Alex
>>
>>
>>> Regards,
>>> Luben
>>>
>>>
>>> On 2021-10-19 09:25, Russell, Kent wrote:
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>> It was the rocm-smi -c flag. Maybe some work was done to make it more robust, that would be nice. But the -c flag is supposed to show the current frequency for each clock type. -g would do the same, but just for SCLK.
>>>
>>>
>>>
>>> Kent
>>>
>>>
>>>
>>> From: Tuikov, Luben <Luben.Tuikov@amd.com>
>>> Sent: Tuesday, October 19, 2021 12:27 AM
>>> To: Russell, Kent <Kent.Russell@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Quan, Evan <Evan.Quan@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>
>>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>>
>>>
>>>
>>> Kent,
>>>
>>> What is the command which fails?
>>> I can try to duplicate it here.
>>>
>>> So far, things I've tried, I cannot make rocm-smi fail. Command arguments?
>>>
>>> Regards,
>>> Luben
>>>
>>> On 2021-10-18 21:06, Russell, Kent wrote:
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>> The * is required for the rocm-smi’s functionality for showing what the current clocks are. We had a bug before where the * was removed, then the SMI died fantastically. Work could be done to try to handle that type of situation, but the SMI has a “show current clocks” and uses the * to determine which one is active
>>>
>>>
>>>
>>> Kent
>>>
>>>
>>>
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Russell, Kent
>>> Sent: Monday, October 18, 2021 9:05 PM
>>> To: Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Quan, Evan <Evan.Quan@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>
>>> Subject: RE: [PATCH 0/5] 0 MHz is not a valid current frequency
>>>
>>>
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>> +Harish, rocm-smi falls under his purview now.
>>>
>>>
>>>
>>> Kent
>>>
>>>
>>>
>>> From: Tuikov, Luben <Luben.Tuikov@amd.com>
>>> Sent: Monday, October 18, 2021 4:30 PM
>>> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Quan, Evan <Evan.Quan@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org; Russell, Kent <Kent.Russell@amd.com>
>>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>>
>>>
>>>
>>> I think Kent is already seen these patches as he did comment on 1/5 patch.
>>>
>>> The v3 version of the patch, posted last week, removes the asterisk to report the lowest frequency as the current frequency, when the current frequency is 0, i.e. when the block is in low power state. Does the tool rely on the asterisk? If this information is necessary could it not use amdgpu_pm_info?
>>>
>>> Regards,
>>> Luben
>>>
>>> On 2021-10-18 16:19, Deucher, Alexander wrote:
>>>
>>> [Public]
>>>
>>>
>>>
>>> We the current behavior (0 for clock) already crashes the tool, so I don't think we can really make things worse.
>>>
>>>
>>>
>>> Alex
>>>
>>>
>>>
>>> ________________________________
>>>
>>> From: Quan, Evan <Evan.Quan@amd.com>
>>> Sent: Thursday, October 14, 2021 10:25 PM
>>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Russell, Kent <Kent.Russell@amd.com>
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Subject: RE: [PATCH 0/5] 0 MHz is not a valid current frequency
>>>
>>>
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>> +Kent who maintains the Rocm tool
>>>
>>>
>>>
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Lazar, Lijo
>>> Sent: Thursday, October 14, 2021 1:07 AM
>>> To: Tuikov, Luben <Luben.Tuikov@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>>
>>>
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>>> Or maybe just a list without default hint, i.e. no asterisk?
>>>
>>>
>>> I think this is also fine meaning we are having trouble in determining the current frequency or DPM level. Evan/Alex? Don't know if this will crash the tools.
>>>
>>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>> ________________________________
>>>
>>> From: Tuikov, Luben <Luben.Tuikov@amd.com>
>>> Sent: Wednesday, October 13, 2021 9:52:09 PM
>>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Subject: Re: [PATCH 0/5] 0 MHz is not a valid current frequency
>>>
>>>
>>>
>>> On 2021-10-13 00:14, Lazar, Lijo wrote:
>>>> On 10/13/2021 8:40 AM, Luben Tuikov wrote:
>>>>> Some ASIC support low-power functionality for the whole ASIC or just
>>>>> an IP block. When in such low-power mode, some sysfs interfaces would
>>>>> report a frequency of 0, e.g.,
>>>>>
>>>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>>>>> 0: 500Mhz
>>>>> 1: 0Mhz *
>>>>> 2: 2200Mhz
>>>>> $_
>>>>>
>>>>> An operating frequency of 0 MHz doesn't make sense, and this interface
>>>>> is designed to report only operating clock frequencies, i.e. non-zero,
>>>>> and possibly the current one.
>>>>>
>>>>> When in this low-power state, round to the smallest
>>>>> operating frequency, for this interface, as follows,
>>>>>
>>>> Would rather avoid this -
>>>>
>>>> 1) It is manipulating FW reported value. If at all there is an uncaught
>>>> issue in FW reporting of frequency values, that is masked here.
>>>> 2) Otherwise, if 0MHz is described as GFX power gated case, this
>>>> provides a convenient interface to check if GFX is power gated.
>>>>
>>>> If seeing a '0' is not pleasing, consider changing to something like
>>>>         "NA" - not available (frequency cannot be fetched at the moment).
>>> There's a ROCm tool which literally asserts if the values are not ordered in increasing order. Now since 0 < 550, but 0 is listed as the second entry, the tool simply asserts and crashes.
>>>
>>> It is not clear what you'd rather see here:
>>>
>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>>> 0: 550Mhz
>>> 1: N/A *
>>> 2: 2200MHz
>>> $_
>>>
>>> Is this what you want to see? (That'll crash other tools which expect %uMhz.)
>>>
>>> Or maybe just a list without default hint, i.e. no asterisk?
>>>
>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>>> 0: 550Mhz
>>> 1: 2200MHz
>>> $_
>>>
>>> What should the output be?
>>>
>>> We want to avoid showing 0, but still show numbers.
>>>
>>> Regards,
>>> Luben
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> $cat /sys/class/drm/card0/device/pp_dpm_sclk
>>>>> 0: 500Mhz *
>>>>> 1: 2200Mhz
>>>>> $_
>>>>>
>>>>> Luben Tuikov (5):
>>>>>     drm/amd/pm: Slight function rename
>>>>>     drm/amd/pm: Rename cur_value to curr_value
>>>>>     drm/amd/pm: Rename freq_values --> freq_value
>>>>>     dpm/amd/pm: Sienna: 0 MHz is not a current clock frequency
>>>>>     dpm/amd/pm: Navi10: 0 MHz is not a current clock frequency
>>>>>
>>>>>    .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 60 +++++++++------
>>>>>    .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 73 ++++++++++++-------
>>>>>    2 files changed, 86 insertions(+), 47 deletions(-)
>>>>>
>>>
>>>
>>>
>>>
>>>
> 

  reply	other threads:[~2021-10-27  5:21 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-13  3:10 [PATCH 0/5] 0 MHz is not a valid current frequency Luben Tuikov
2021-10-13  3:10 ` [PATCH 1/5] drm/amd/pm: Slight function rename Luben Tuikov
2021-10-13 14:50   ` Russell, Kent
2021-10-13 14:53     ` Luben Tuikov
2021-10-18 22:52   ` Paul Menzel
2021-10-18 23:02     ` Luben Tuikov
2021-10-13  3:10 ` [PATCH 2/5] drm/amd/pm: Rename cur_value to curr_value Luben Tuikov
2021-10-13  3:10 ` [PATCH 3/5] drm/amd/pm: Rename freq_values --> freq_value Luben Tuikov
2021-10-13  3:10 ` [PATCH 4/5] dpm/amd/pm: Sienna: 0 MHz is not a current clock frequency Luben Tuikov
2021-10-13  3:10 ` [PATCH 5/5] dpm/amd/pm: Navi10: " Luben Tuikov
2021-10-13  4:14 ` [PATCH 0/5] 0 MHz is not a valid current frequency Lazar, Lijo
2021-10-13  7:06   ` Quan, Evan
2021-10-13 13:58   ` Alex Deucher
2021-10-13 14:17     ` Lazar, Lijo
2021-10-13 16:22   ` Luben Tuikov
2021-10-13 17:06     ` Lazar, Lijo
2021-10-13 17:19       ` Alex Deucher
2021-10-15  2:25       ` Quan, Evan
2021-10-18 20:19         ` Deucher, Alexander
2021-10-18 20:29           ` Luben Tuikov
2021-10-19  1:04             ` Russell, Kent
2021-10-19  1:06               ` Russell, Kent
2021-10-19  4:27                 ` Luben Tuikov
2021-10-19 13:25                   ` Russell, Kent
2021-10-19 13:54                     ` Luben Tuikov
2021-10-26 21:26                       ` Alex Deucher
2021-10-26 22:00                         ` Luben Tuikov
2021-10-27  5:20                           ` Lazar, Lijo [this message]
2021-10-27 15:12                             ` Alex Deucher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0ed1ed4-42e8-4c05-f027-7bbb5486909b@amd.com \
    --to=lijo.lazar@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Evan.Quan@amd.com \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=Kent.Russell@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=luben.tuikov@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.