Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register

From: Luben Tuikov <luben.tuikov@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Alex Jivin" <alex.jivin@amd.com>,
	amd-gfx@lists.freedesktop.org
Cc: Alex Deucher <alexander.deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register
Date: Mon, 6 Jul 2020 08:42:12 -0400	[thread overview]
Message-ID: <c3a95ef4-985e-b919-cf66-729a1218f096@amd.com> (raw)
In-Reply-To: <129c10a5-a531-96a2-83ac-d3bd760c426d@amd.com>

On 2020-06-30 3:01 a.m., Christian König wrote:
> Am 30.06.20 um 00:46 schrieb Luben Tuikov:
>> On 2020-06-26 1:04 p.m., Christian König wrote:
>>> Am 26.06.20 um 18:12 schrieb Alex Jivin:
>>>> Adding a delay between writing to UVD control register and reading from it.
>>>> This is to allow the HW to process the write command.
>>>>
>>>> Signed-off-by: Alex Jivin <alex.jivin@amd.com>
>>>> Suggested-By: Luben Tukov <luben.tuikov@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++
>>>>    1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
>>>> index 9d7b4ccd17b8..42cdc14fb79d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/si.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
>>>> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev,
>>>>    	/* Assert UPLL_CTLREQ */
>>>>    	WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK);
>>>>    
>>>> +	/* Waiting for HW to process the previous write.
>>>> +	 * This is to give a chance to HW to act before
>>>> +	 * the first read is done.
>>>> +	 */
>>>> +	mdelay(1);
>>>> +
>>> Mhm, that is most likely not a good idea.
>>>
>>> We need to issue a read after the write to make sure that the stuff is
>>> send out to the hardware.

^^^ This is (A).

>> Tracing down WREG32_P(), it seems to be writing and then reading the register,
> 
> Why do you think so?

soc15_pcie_wreg() shows it. But as I said, this may have been  broken by
this 2eee0229f65e897134566888e5321bcb3af0df7a
patch and there's a separate thread on this.

> 
> What the macro should do is to read, apply mask and value and then write 
> it back to the register:
> 
> #define WREG32_P(reg, val, mask)                \
>      do {                            \
>          uint32_t tmp_ = RREG32(reg);            \
>          tmp_ &= (mask);                    \
>          tmp_ |= ((val) & ~(mask));            \
>          WREG32(reg, tmp_);                \
>      } while (0)
> 
>> back to back, twice over, when deferring to PCIe space, and just writel() when in mmio
>> space. (There is separate thread on this as it doesn't seem correct.)
> 
> That indeed sounds fishy, but all those registers should be in mmio space.
> 
>>
>>> Adding a delay here is probably just postponing that. Do we have some
>>> note in the documentation that this is necessary?
>> Flushing the write buffer (by issuing a read if necessary) is different
>> than waiting for the hardware to process the request.
> 
> It's not flushing the write buffer. The hardware is often build in a way 
> that the next read is triggering the action instead of the write.

I don't think that the "read is triggering the action", in this case.
The write is triggering the action, when it makes it to the hardware.

Also, often enough, the coherency unit is built with dependency detection.
This means that a read following a write to the same "hardware" address,
would cause the read op to trigger the flushing of the write buffer
if the address is still in the write buffer. And then the read
would be issued, following the flush, usually by the bus unit (BU).
(So the bus would show a bunch of WRITES followed by our READ.)
You seem to agree as you said that in (A) above.

> That's done because writes should be accepted by the hardware block 
> immediately while reads have a timeout associated with it.

That's is true but irrelevant here.

As the code is present right now in the kernel driver,
if the hardware changes state on the last delay of 10 ms,
you're going to miss that as no checks are done after the delay.
The code clearly shows this.

All I'm saying is that to avoid this "race" condition,
the code should be modified, as there is no point in delaying 10
ms only to quit after that, when count gets to equal MAX_COUNT.

>> The current code does flush the write buffer by reading back,
>> and then delaying (both in the loop), which does achieve this,
>> but it leaves a corner case as I wrote in my review.
>> The corner case is that if the status
>> register changes to "done" while in the last "delay()"
>> we then check the loop invariant and exit, as opposed to reading
>> the register and determining that it is done successfully.
>>
>> Current, all in pseudo-code:
>>
>> write()
>>
>> for (count = 0; count < MAX_COUNT; count++) {
>> 	res = read()
>> 	if (res is success) break
>> 	mdelay(10)              <-- if it changes here, we miss it on the last iteration
>> }
>>
>> Optimal:
>>
>> write()   ; assume write buffer flush
>> mdelay(9)
>> do {
>> 	mdelay(1)
>> 	res = read()
>> } while (res != success && ++count < MAX_COUNT)
>>
>> This solves the corner case, and ensures a time delay of 10 for
>> the hardware to process its job, but a delay of 1 for polling
>> the status, as it could be done "anytime now."
> 
> That seems to assume that the ACK bits are not immediately cleared on 
> the write. How do you got to that conclusion?

No such assumption was taken.
This is about the wasted 10 ms delay
as pointed to by the current code quoted above; the comment
on the line with an arrow in it.

> 
> Processing the request can take even longer than 10ms depending on what 
> is done, which reference clock is selected, the temperature of the 
> hardware, sleep mode etc etc...

That's why the loop.

> 
[snip!]
> 
>> Assuming that WREG32_P() flushes the write buffer, as it seems
>> that it does, the idea here is to give the hardware some time to process
>> the request (from writing a value to it), but when polling to poll
>> a shorter amount of time.
> 
> Not a bad idea, but this is based on the diagnostic code and already 
> used for nearly a decade.

Yes, I understand that it does work right now. Surely,
if you set the loop limit high enough, you can show
that you'd iterate at least one over the number required
to complete or fail, and thus "used for nearly a decade".

But from a computational point of view it is not correct
in that the very last delay is a wasted delay:

	write()
	for (count = 0; count < MAX_COUNT; count++) {
		res = read()
		if (res is success) break
		mdelay(10)              <-- if it changes here, we miss it on the last iteration
	}

It's not necessary to fix this, but it is good to point out
that this is currently the case.

> In other words this code is what the hardware engineers recommended, is 
> used *much* more often outside the driver and known to be working well.

Ah, yes, I didn't want to resort to the "hardware engineers recommended",
but here is what I've been told by hardware engineers (as seen by their
interface):

1. Write your request to the hardware.
2. Wait M ms for the hardware to do its thing.
3. Read back and check status.
4. If done, then you're good.
5. If not done, wait a tiny bit of time, N, and
   then check again, until done or timeout.

Usually N << M, since M is the actual time for
hardware do do something, and N is a lot smaller
than M, since it is just back-off time as
the hardware can be done "anytime now".

As a waveform it would look like this:

Immediate success: WRITE, 10, READ.
Delayed success:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
Delayed failure:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.

The do-while loop above accomplishes this. The loop always ends on a read.
The current code would end on a delay on a failure, like this:
Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10.

But, as it is working right now "for a decade",
it can be left as is, as long as we realize that
the "last" delay is not necessary, when we're failing.

> 
>> I would've also liked to see the mutex business fixed
>> as well from my original patch review, as it is easy to prove
>> that it is always taken, so no need to embed it inside the if().
> 
> That is indeed a software problem which should be fixed.

I've not seen a patch fixing this yet.

Regards,
Luben

> 
> Regards,
> Christian.
> 
>>
>> Regards,
>> Luben
>>
>>> Christian.
>>>
>>>>    	/* Wait for CTLACK and CTLACK2 to get asserted */
>>>>    	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>>>>    		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx