All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Adding wait time before reading upll control register
@ 2020-06-26 16:12 Alex Jivin
  2020-06-26 17:04 ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Jivin @ 2020-06-26 16:12 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Luben Tuikov

Adding a delay between writing to UVD control register and reading from it.
This is to allow the HW to process the write command.

Signed-off-by: Alex Jivin <alex.jivin@amd.com>
Suggested-By: Luben Tukov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
index 9d7b4ccd17b8..42cdc14fb79d 100644
--- a/drivers/gpu/drm/amd/amdgpu/si.c
+++ b/drivers/gpu/drm/amd/amdgpu/si.c
@@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev,
 	/* Assert UPLL_CTLREQ */
 	WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK);
 
+	/* Waiting for HW to process the previous write.
+	 * This is to give a chance to HW to act before
+	 * the first read is done.
+	 */
+	mdelay(1);
+
 	/* Wait for CTLACK and CTLACK2 to get asserted */
 	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
 		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register
  2020-06-26 16:12 [PATCH] drm/amdgpu: Adding wait time before reading upll control register Alex Jivin
@ 2020-06-26 17:04 ` Christian König
  2020-06-29 22:46   ` Luben Tuikov
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2020-06-26 17:04 UTC (permalink / raw)
  To: Alex Jivin, amd-gfx; +Cc: Alex Deucher, Luben Tuikov

Am 26.06.20 um 18:12 schrieb Alex Jivin:
> Adding a delay between writing to UVD control register and reading from it.
> This is to allow the HW to process the write command.
>
> Signed-off-by: Alex Jivin <alex.jivin@amd.com>
> Suggested-By: Luben Tukov <luben.tuikov@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
> index 9d7b4ccd17b8..42cdc14fb79d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/si.c
> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev,
>   	/* Assert UPLL_CTLREQ */
>   	WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK);
>   
> +	/* Waiting for HW to process the previous write.
> +	 * This is to give a chance to HW to act before
> +	 * the first read is done.
> +	 */
> +	mdelay(1);
> +

Mhm, that is most likely not a good idea.

We need to issue a read after the write to make sure that the stuff is 
send out to the hardware.

Adding a delay here is probably just postponing that. Do we have some 
note in the documentation that this is necessary?

Christian.

>   	/* Wait for CTLACK and CTLACK2 to get asserted */
>   	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>   		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register
  2020-06-26 17:04 ` Christian König
@ 2020-06-29 22:46   ` Luben Tuikov
  2020-06-30  7:01     ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Luben Tuikov @ 2020-06-29 22:46 UTC (permalink / raw)
  To: christian.koenig, Alex Jivin, amd-gfx; +Cc: Alex Deucher

On 2020-06-26 1:04 p.m., Christian König wrote:
> Am 26.06.20 um 18:12 schrieb Alex Jivin:
>> Adding a delay between writing to UVD control register and reading from it.
>> This is to allow the HW to process the write command.
>>
>> Signed-off-by: Alex Jivin <alex.jivin@amd.com>
>> Suggested-By: Luben Tukov <luben.tuikov@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
>> index 9d7b4ccd17b8..42cdc14fb79d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/si.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
>> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev,
>>   	/* Assert UPLL_CTLREQ */
>>   	WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK);
>>   
>> +	/* Waiting for HW to process the previous write.
>> +	 * This is to give a chance to HW to act before
>> +	 * the first read is done.
>> +	 */
>> +	mdelay(1);
>> +
> 
> Mhm, that is most likely not a good idea.
> 
> We need to issue a read after the write to make sure that the stuff is 
> send out to the hardware.

Tracing down WREG32_P(), it seems to be writing and then reading the register,
back to back, twice over, when deferring to PCIe space, and just writel() when in mmio
space. (There is separate thread on this as it doesn't seem correct.)

> Adding a delay here is probably just postponing that. Do we have some 
> note in the documentation that this is necessary?

Flushing the write buffer (by issuing a read if necessary) is different
than waiting for the hardware to process the request.

The current code does flush the write buffer by reading back,
and then delaying (both in the loop), which does achieve this,
but it leaves a corner case as I wrote in my review.
The corner case is that if the status
register changes to "done" while in the last "delay()"
we then check the loop invariant and exit, as opposed to reading
the register and determining that it is done successfully.

Current, all in pseudo-code:

write()

for (count = 0; count < MAX_COUNT; count++) {
	res = read()
	if (res is success) break
	mdelay(10)              <-- if it changes here, we miss it on the last iteration
}

Optimal:

write()   ; assume write buffer flush
mdelay(9)
do {
	mdelay(1)
	res = read()
} while (res != success && ++count < MAX_COUNT)

This solves the corner case, and ensures a time delay of 10 for
the hardware to process its job, but a delay of 1 for polling
the status, as it could be done "anytime now."

Assuming that WREG32_P() flushes the write buffer, as it seems
that it does, the idea here is to give the hardware some time to process
the request (from writing a value to it), but when polling to poll
a shorter amount of time.

I would've also liked to see the mutex business fixed
as well from my original patch review, as it is easy to prove
that it is always taken, so no need to embed it inside the if().

Regards,
Luben

> 
> Christian.
> 
>>   	/* Wait for CTLACK and CTLACK2 to get asserted */
>>   	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>>   		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register
  2020-06-29 22:46   ` Luben Tuikov
@ 2020-06-30  7:01     ` Christian König
  2020-07-06 12:42       ` Luben Tuikov
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2020-06-30  7:01 UTC (permalink / raw)
  To: Luben Tuikov, Alex Jivin, amd-gfx; +Cc: Alex Deucher

Am 30.06.20 um 00:46 schrieb Luben Tuikov:
> On 2020-06-26 1:04 p.m., Christian König wrote:
>> Am 26.06.20 um 18:12 schrieb Alex Jivin:
>>> Adding a delay between writing to UVD control register and reading from it.
>>> This is to allow the HW to process the write command.
>>>
>>> Signed-off-by: Alex Jivin <alex.jivin@amd.com>
>>> Suggested-By: Luben Tukov <luben.tuikov@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
>>> index 9d7b4ccd17b8..42cdc14fb79d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/si.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
>>> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev,
>>>    	/* Assert UPLL_CTLREQ */
>>>    	WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK);
>>>    
>>> +	/* Waiting for HW to process the previous write.
>>> +	 * This is to give a chance to HW to act before
>>> +	 * the first read is done.
>>> +	 */
>>> +	mdelay(1);
>>> +
>> Mhm, that is most likely not a good idea.
>>
>> We need to issue a read after the write to make sure that the stuff is
>> send out to the hardware.
> Tracing down WREG32_P(), it seems to be writing and then reading the register,

Why do you think so?

What the macro should do is to read, apply mask and value and then write 
it back to the register:

#define WREG32_P(reg, val, mask)                \
     do {                            \
         uint32_t tmp_ = RREG32(reg);            \
         tmp_ &= (mask);                    \
         tmp_ |= ((val) & ~(mask));            \
         WREG32(reg, tmp_);                \
     } while (0)

> back to back, twice over, when deferring to PCIe space, and just writel() when in mmio
> space. (There is separate thread on this as it doesn't seem correct.)

That indeed sounds fishy, but all those registers should be in mmio space.

>
>> Adding a delay here is probably just postponing that. Do we have some
>> note in the documentation that this is necessary?
> Flushing the write buffer (by issuing a read if necessary) is different
> than waiting for the hardware to process the request.

It's not flushing the write buffer. The hardware is often build in a way 
that the next read is triggering the action instead of the write.

That's done because writes should be accepted by the hardware block 
immediately while reads have a timeout associated with it.

> The current code does flush the write buffer by reading back,
> and then delaying (both in the loop), which does achieve this,
> but it leaves a corner case as I wrote in my review.
> The corner case is that if the status
> register changes to "done" while in the last "delay()"
> we then check the loop invariant and exit, as opposed to reading
> the register and determining that it is done successfully.
>
> Current, all in pseudo-code:
>
> write()
>
> for (count = 0; count < MAX_COUNT; count++) {
> 	res = read()
> 	if (res is success) break
> 	mdelay(10)              <-- if it changes here, we miss it on the last iteration
> }
>
> Optimal:
>
> write()   ; assume write buffer flush
> mdelay(9)
> do {
> 	mdelay(1)
> 	res = read()
> } while (res != success && ++count < MAX_COUNT)
>
> This solves the corner case, and ensures a time delay of 10 for
> the hardware to process its job, but a delay of 1 for polling
> the status, as it could be done "anytime now."

That seems to assume that the ACK bits are not immediately cleared on 
the write. How do you got to that conclusion?


Processing the request can take even longer than 10ms depending on what 
is done, which reference clock is selected, the temperature of the 
hardware, sleep mode etc etc...

For example switching to bypass mode usually comes immediately, e.g. you 
write the CNTL register and you immediately get an ack back.

But as far as I know switching from bypass back to normal means that we 
need to wait for both the reference, feedback and output signal to be 
logical low and then high again at the same time.

Depending on the frequencies we got here that can take a bit of time.

> Assuming that WREG32_P() flushes the write buffer, as it seems
> that it does, the idea here is to give the hardware some time to process
> the request (from writing a value to it), but when polling to poll
> a shorter amount of time.

Not a bad idea, but this is based on the diagnostic code and already 
used for nearly a decade.

In other words this code is what the hardware engineers recommended, is 
used *much* more often outside the driver and known to be working well.

> I would've also liked to see the mutex business fixed
> as well from my original patch review, as it is easy to prove
> that it is always taken, so no need to embed it inside the if().

That is indeed a software problem which should be fixed.

Regards,
Christian.

>
> Regards,
> Luben
>
>> Christian.
>>
>>>    	/* Wait for CTLACK and CTLACK2 to get asserted */
>>>    	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>>>    		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register
  2020-06-30  7:01     ` Christian König
@ 2020-07-06 12:42       ` Luben Tuikov
  2020-07-06 12:59         ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Luben Tuikov @ 2020-07-06 12:42 UTC (permalink / raw)
  To: Christian König, Alex Jivin, amd-gfx; +Cc: Alex Deucher

On 2020-06-30 3:01 a.m., Christian König wrote:
> Am 30.06.20 um 00:46 schrieb Luben Tuikov:
>> On 2020-06-26 1:04 p.m., Christian König wrote:
>>> Am 26.06.20 um 18:12 schrieb Alex Jivin:
>>>> Adding a delay between writing to UVD control register and reading from it.
>>>> This is to allow the HW to process the write command.
>>>>
>>>> Signed-off-by: Alex Jivin <alex.jivin@amd.com>
>>>> Suggested-By: Luben Tukov <luben.tuikov@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++
>>>>    1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
>>>> index 9d7b4ccd17b8..42cdc14fb79d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/si.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
>>>> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev,
>>>>    	/* Assert UPLL_CTLREQ */
>>>>    	WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK);
>>>>    
>>>> +	/* Waiting for HW to process the previous write.
>>>> +	 * This is to give a chance to HW to act before
>>>> +	 * the first read is done.
>>>> +	 */
>>>> +	mdelay(1);
>>>> +
>>> Mhm, that is most likely not a good idea.
>>>
>>> We need to issue a read after the write to make sure that the stuff is
>>> send out to the hardware.

^^^ This is (A).

>> Tracing down WREG32_P(), it seems to be writing and then reading the register,
> 
> Why do you think so?

soc15_pcie_wreg() shows it. But as I said, this may have been  broken by
this 2eee0229f65e897134566888e5321bcb3af0df7a
patch and there's a separate thread on this.

> 
> What the macro should do is to read, apply mask and value and then write 
> it back to the register:
> 
> #define WREG32_P(reg, val, mask)                \
>      do {                            \
>          uint32_t tmp_ = RREG32(reg);            \
>          tmp_ &= (mask);                    \
>          tmp_ |= ((val) & ~(mask));            \
>          WREG32(reg, tmp_);                \
>      } while (0)
> 
>> back to back, twice over, when deferring to PCIe space, and just writel() when in mmio
>> space. (There is separate thread on this as it doesn't seem correct.)
> 
> That indeed sounds fishy, but all those registers should be in mmio space.
> 
>>
>>> Adding a delay here is probably just postponing that. Do we have some
>>> note in the documentation that this is necessary?
>> Flushing the write buffer (by issuing a read if necessary) is different
>> than waiting for the hardware to process the request.
> 
> It's not flushing the write buffer. The hardware is often build in a way 
> that the next read is triggering the action instead of the write.

I don't think that the "read is triggering the action", in this case.
The write is triggering the action, when it makes it to the hardware.

Also, often enough, the coherency unit is built with dependency detection.
This means that a read following a write to the same "hardware" address,
would cause the read op to trigger the flushing of the write buffer
if the address is still in the write buffer. And then the read
would be issued, following the flush, usually by the bus unit (BU).
(So the bus would show a bunch of WRITES followed by our READ.)
You seem to agree as you said that in (A) above.

> That's done because writes should be accepted by the hardware block 
> immediately while reads have a timeout associated with it.

That's is true but irrelevant here.

As the code is present right now in the kernel driver,
if the hardware changes state on the last delay of 10 ms,
you're going to miss that as no checks are done after the delay.
The code clearly shows this.

All I'm saying is that to avoid this "race" condition,
the code should be modified, as there is no point in delaying 10
ms only to quit after that, when count gets to equal MAX_COUNT.

>> The current code does flush the write buffer by reading back,
>> and then delaying (both in the loop), which does achieve this,
>> but it leaves a corner case as I wrote in my review.
>> The corner case is that if the status
>> register changes to "done" while in the last "delay()"
>> we then check the loop invariant and exit, as opposed to reading
>> the register and determining that it is done successfully.
>>
>> Current, all in pseudo-code:
>>
>> write()
>>
>> for (count = 0; count < MAX_COUNT; count++) {
>> 	res = read()
>> 	if (res is success) break
>> 	mdelay(10)              <-- if it changes here, we miss it on the last iteration
>> }
>>
>> Optimal:
>>
>> write()   ; assume write buffer flush
>> mdelay(9)
>> do {
>> 	mdelay(1)
>> 	res = read()
>> } while (res != success && ++count < MAX_COUNT)
>>
>> This solves the corner case, and ensures a time delay of 10 for
>> the hardware to process its job, but a delay of 1 for polling
>> the status, as it could be done "anytime now."
> 
> That seems to assume that the ACK bits are not immediately cleared on 
> the write. How do you got to that conclusion?

No such assumption was taken.
This is about the wasted 10 ms delay
as pointed to by the current code quoted above; the comment
on the line with an arrow in it.

> 
> Processing the request can take even longer than 10ms depending on what 
> is done, which reference clock is selected, the temperature of the 
> hardware, sleep mode etc etc...

That's why the loop.

> 
[snip!]
> 
>> Assuming that WREG32_P() flushes the write buffer, as it seems
>> that it does, the idea here is to give the hardware some time to process
>> the request (from writing a value to it), but when polling to poll
>> a shorter amount of time.
> 
> Not a bad idea, but this is based on the diagnostic code and already 
> used for nearly a decade.

Yes, I understand that it does work right now. Surely,
if you set the loop limit high enough, you can show
that you'd iterate at least one over the number required
to complete or fail, and thus "used for nearly a decade".

But from a computational point of view it is not correct
in that the very last delay is a wasted delay:

	write()
	for (count = 0; count < MAX_COUNT; count++) {
		res = read()
		if (res is success) break
		mdelay(10)              <-- if it changes here, we miss it on the last iteration
	}

It's not necessary to fix this, but it is good to point out
that this is currently the case.

> In other words this code is what the hardware engineers recommended, is 
> used *much* more often outside the driver and known to be working well.

Ah, yes, I didn't want to resort to the "hardware engineers recommended",
but here is what I've been told by hardware engineers (as seen by their
interface):

1. Write your request to the hardware.
2. Wait M ms for the hardware to do its thing.
3. Read back and check status.
4. If done, then you're good.
5. If not done, wait a tiny bit of time, N, and
   then check again, until done or timeout.

Usually N << M, since M is the actual time for
hardware do do something, and N is a lot smaller
than M, since it is just back-off time as
the hardware can be done "anytime now".

As a waveform it would look like this:

Immediate success: WRITE, 10, READ.
Delayed success:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
Delayed failure:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.

The do-while loop above accomplishes this. The loop always ends on a read.
The current code would end on a delay on a failure, like this:
Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10.

But, as it is working right now "for a decade",
it can be left as is, as long as we realize that
the "last" delay is not necessary, when we're failing.

> 
>> I would've also liked to see the mutex business fixed
>> as well from my original patch review, as it is easy to prove
>> that it is always taken, so no need to embed it inside the if().
> 
> That is indeed a software problem which should be fixed.

I've not seen a patch fixing this yet.

Regards,
Luben

> 
> Regards,
> Christian.
> 
>>
>> Regards,
>> Luben
>>
>>> Christian.
>>>
>>>>    	/* Wait for CTLACK and CTLACK2 to get asserted */
>>>>    	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>>>>    		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register
  2020-07-06 12:42       ` Luben Tuikov
@ 2020-07-06 12:59         ` Christian König
  2020-07-06 13:53           ` Luben Tuikov
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2020-07-06 12:59 UTC (permalink / raw)
  To: Luben Tuikov, Alex Jivin, amd-gfx; +Cc: Alex Deucher

Am 06.07.20 um 14:42 schrieb Luben Tuikov:
> On 2020-06-30 3:01 a.m., Christian König wrote:
>> Am 30.06.20 um 00:46 schrieb Luben Tuikov:
>>> On 2020-06-26 1:04 p.m., Christian König wrote:
>>>> Am 26.06.20 um 18:12 schrieb Alex Jivin:
>>>> [SNIP] 
>>>> Adding a delay here is probably just postponing that. Do we have some
>>>> note in the documentation that this is necessary?
>>> Flushing the write buffer (by issuing a read if necessary) is different
>>> than waiting for the hardware to process the request.
>> It's not flushing the write buffer. The hardware is often build in a way
>> that the next read is triggering the action instead of the write.
> I don't think that the "read is triggering the action", in this case.
> The write is triggering the action, when it makes it to the hardware.
>
> Also, often enough, the coherency unit is built with dependency detection.
> This means that a read following a write to the same "hardware" address,
> would cause the read op to trigger the flushing of the write buffer
> if the address is still in the write buffer. And then the read
> would be issued, following the flush, usually by the bus unit (BU).
> (So the bus would show a bunch of WRITES followed by our READ.)
> You seem to agree as you said that in (A) above.

Yeah, but this is not what I meant here.

See a bit below for more description on what our hardware often does.

>> That's done because writes should be accepted by the hardware block
>> immediately while reads have a timeout associated with it.
> That's is true but irrelevant here.

No it isn't. See the read of the register acts as a delay as well.

The read doesn't comes back to the CPU immediately when there is an 
ongoing action to wait for.

Some of the SDMA, TC and HDP registers even go as far as delaying the 
return of the read until some memory is flushed.

The maximum timeout here is rather large, something in the range of 65ms 
IIRC.

> As the code is present right now in the kernel driver,
> if the hardware changes state on the last delay of 10 ms,
> you're going to miss that as no checks are done after the delay.
> The code clearly shows this.
>
> All I'm saying is that to avoid this "race" condition,
> the code should be modified, as there is no point in delaying 10
> ms only to quit after that, when count gets to equal MAX_COUNT.

Well, that is indeed a good point!

>>> The current code does flush the write buffer by reading back,
>>> and then delaying (both in the loop), which does achieve this,
>>> but it leaves a corner case as I wrote in my review.
>>> The corner case is that if the status
>>> register changes to "done" while in the last "delay()"
>>> we then check the loop invariant and exit, as opposed to reading
>>> the register and determining that it is done successfully.
>>>
>>> Current, all in pseudo-code:
>>>
>>> write()
>>>
>>> for (count = 0; count < MAX_COUNT; count++) {
>>> 	res = read()
>>> 	if (res is success) break
>>> 	mdelay(10)              <-- if it changes here, we miss it on the last iteration
>>> }
>>>
>>> Optimal:
>>>
>>> write()   ; assume write buffer flush
>>> mdelay(9)
>>> do {
>>> 	mdelay(1)
>>> 	res = read()
>>> } while (res != success && ++count < MAX_COUNT)
>>>
>>> This solves the corner case, and ensures a time delay of 10 for
>>> the hardware to process its job, but a delay of 1 for polling
>>> the status, as it could be done "anytime now."
>> That seems to assume that the ACK bits are not immediately cleared on
>> the write. How do you got to that conclusion?
> No such assumption was taken.
> This is about the wasted 10 ms delay
> as pointed to by the current code quoted above; the comment
> on the line with an arrow in it.

Yeah, now I understand what you are trying to do here. Problem is that 
this messes up the read delay feature of the hardware with that suggestion.

In other words we should not have is a delay between the initial write 
and the first read since the hardware will take care of this for us.

Regards,
Christian.

>
>> Processing the request can take even longer than 10ms depending on what
>> is done, which reference clock is selected, the temperature of the
>> hardware, sleep mode etc etc...
> That's why the loop.
>
> [snip!]
>>> Assuming that WREG32_P() flushes the write buffer, as it seems
>>> that it does, the idea here is to give the hardware some time to process
>>> the request (from writing a value to it), but when polling to poll
>>> a shorter amount of time.
>> Not a bad idea, but this is based on the diagnostic code and already
>> used for nearly a decade.
> Yes, I understand that it does work right now. Surely,
> if you set the loop limit high enough, you can show
> that you'd iterate at least one over the number required
> to complete or fail, and thus "used for nearly a decade".
>
> But from a computational point of view it is not correct
> in that the very last delay is a wasted delay:
>
> 	write()
> 	for (count = 0; count < MAX_COUNT; count++) {
> 		res = read()
> 		if (res is success) break
> 		mdelay(10)              <-- if it changes here, we miss it on the last iteration
> 	}
>
> It's not necessary to fix this, but it is good to point out
> that this is currently the case.
>
>> In other words this code is what the hardware engineers recommended, is
>> used *much* more often outside the driver and known to be working well.
> Ah, yes, I didn't want to resort to the "hardware engineers recommended",
> but here is what I've been told by hardware engineers (as seen by their
> interface):
>
> 1. Write your request to the hardware.
> 2. Wait M ms for the hardware to do its thing.
> 3. Read back and check status.
> 4. If done, then you're good.
> 5. If not done, wait a tiny bit of time, N, and
>     then check again, until done or timeout.
>
> Usually N << M, since M is the actual time for
> hardware do do something, and N is a lot smaller
> than M, since it is just back-off time as
> the hardware can be done "anytime now".
>
> As a waveform it would look like this:
>
> Immediate success: WRITE, 10, READ.
> Delayed success:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
> Delayed failure:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
>
> The do-while loop above accomplishes this. The loop always ends on a read.
> The current code would end on a delay on a failure, like this:
> Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10.
>
> But, as it is working right now "for a decade",
> it can be left as is, as long as we realize that
> the "last" delay is not necessary, when we're failing.
>
>>> I would've also liked to see the mutex business fixed
>>> as well from my original patch review, as it is easy to prove
>>> that it is always taken, so no need to embed it inside the if().
>> That is indeed a software problem which should be fixed.
> I've not seen a patch fixing this yet.
>
> Regards,
> Luben
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Luben
>>>
>>>> Christian.
>>>>
>>>>>     	/* Wait for CTLACK and CTLACK2 to get asserted */
>>>>>     	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>>>>>     		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register
  2020-07-06 12:59         ` Christian König
@ 2020-07-06 13:53           ` Luben Tuikov
  0 siblings, 0 replies; 7+ messages in thread
From: Luben Tuikov @ 2020-07-06 13:53 UTC (permalink / raw)
  To: Christian König, Alex Jivin, amd-gfx; +Cc: Alex Deucher

On 2020-07-06 8:59 a.m., Christian König wrote:
> Am 06.07.20 um 14:42 schrieb Luben Tuikov:
>> On 2020-06-30 3:01 a.m., Christian König wrote:
>>> Am 30.06.20 um 00:46 schrieb Luben Tuikov:
>>>> On 2020-06-26 1:04 p.m., Christian König wrote:
>>>>> Am 26.06.20 um 18:12 schrieb Alex Jivin:
>>>>> [SNIP] 
>>>>> Adding a delay here is probably just postponing that. Do we have some
>>>>> note in the documentation that this is necessary?
>>>> Flushing the write buffer (by issuing a read if necessary) is different
>>>> than waiting for the hardware to process the request.
>>> It's not flushing the write buffer. The hardware is often build in a way
>>> that the next read is triggering the action instead of the write.
>> I don't think that the "read is triggering the action", in this case.
>> The write is triggering the action, when it makes it to the hardware.
>>
>> Also, often enough, the coherency unit is built with dependency detection.
>> This means that a read following a write to the same "hardware" address,
>> would cause the read op to trigger the flushing of the write buffer
>> if the address is still in the write buffer. And then the read
>> would be issued, following the flush, usually by the bus unit (BU).
>> (So the bus would show a bunch of WRITES followed by our READ.)
>> You seem to agree as you said that in (A) above.
> 
> Yeah, but this is not what I meant here.
> 
> See a bit below for more description on what our hardware often does.

This hardware isn't wired to "GO" on a READ following a WRITE, as
you alluded to in your previous email in this thread.

> 
>>> That's done because writes should be accepted by the hardware block
>>> immediately while reads have a timeout associated with it.
>> That's is true but irrelevant here.
> 
> No it isn't. See the read of the register acts as a delay as well.
> 
> The read doesn't comes back to the CPU immediately when there is an 
> ongoing action to wait for.

There is no interlock in this hardware to wait for the status of
the "ongoing action to wait for", as you put it. The READ result
comes back right away, so long there is *any* status to return,
"not done", "no status", "retry", "fail", "done", etc.

[snip!]
> 
>> As the code is present right now in the kernel driver,
>> if the hardware changes state on the last delay of 10 ms,
>> you're going to miss that as no checks are done after the delay.
>> The code clearly shows this.
>>
>> All I'm saying is that to avoid this "race" condition,
>> the code should be modified, as there is no point in delaying 10
>> ms only to quit after that, when count gets to equal MAX_COUNT.
> 
> Well, that is indeed a good point!

That was also in my original patch review.

> 
>>>> The current code does flush the write buffer by reading back,
>>>> and then delaying (both in the loop), which does achieve this,
>>>> but it leaves a corner case as I wrote in my review.
>>>> The corner case is that if the status
>>>> register changes to "done" while in the last "delay()"
>>>> we then check the loop invariant and exit, as opposed to reading
>>>> the register and determining that it is done successfully.
>>>>
>>>> Current, all in pseudo-code:
>>>>
>>>> write()
>>>>
>>>> for (count = 0; count < MAX_COUNT; count++) {
>>>> 	res = read()
>>>> 	if (res is success) break
>>>> 	mdelay(10)              <-- if it changes here, we miss it on the last iteration
>>>> }
>>>>
>>>> Optimal:
>>>>
>>>> write()   ; assume write buffer flush
>>>> mdelay(9)
>>>> do {
>>>> 	mdelay(1)
>>>> 	res = read()
>>>> } while (res != success && ++count < MAX_COUNT)
>>>>
>>>> This solves the corner case, and ensures a time delay of 10 for
>>>> the hardware to process its job, but a delay of 1 for polling
>>>> the status, as it could be done "anytime now."
>>> That seems to assume that the ACK bits are not immediately cleared on
>>> the write. How do you got to that conclusion?
>> No such assumption was taken.
>> This is about the wasted 10 ms delay
>> as pointed to by the current code quoted above; the comment
>> on the line with an arrow in it.
> 
> Yeah, now I understand what you are trying to do here. Problem is that 
> this messes up the read delay feature of the hardware with that suggestion.

I doubt the existence of any "read delay feature",
as that would stall/sputter the CPU.

The WRITE-READ pair, forces the dependency to be resolved and
flushes the write buffer if one is used for that memory.

As the CPU and the bus are very fast, and the action to be
performed by the hardware may take ms (as indicated by the timeout
and the loop), the 1st READ result comes back with "not done",
almost guaranteed, unless there was no action to be performed.

> 
> In other words we should not have is a delay between the initial write 
> and the first read since the hardware will take care of this for us.

Sure, we can have a WRITE-READ pair, *to flush the write buffer*.

Regards,
Luben

> 
> Regards,
> Christian.
> 
>>
>>> Processing the request can take even longer than 10ms depending on what
>>> is done, which reference clock is selected, the temperature of the
>>> hardware, sleep mode etc etc...
>> That's why the loop.
>>
>> [snip!]
>>>> Assuming that WREG32_P() flushes the write buffer, as it seems
>>>> that it does, the idea here is to give the hardware some time to process
>>>> the request (from writing a value to it), but when polling to poll
>>>> a shorter amount of time.
>>> Not a bad idea, but this is based on the diagnostic code and already
>>> used for nearly a decade.
>> Yes, I understand that it does work right now. Surely,
>> if you set the loop limit high enough, you can show
>> that you'd iterate at least one over the number required
>> to complete or fail, and thus "used for nearly a decade".
>>
>> But from a computational point of view it is not correct
>> in that the very last delay is a wasted delay:
>>
>> 	write()
>> 	for (count = 0; count < MAX_COUNT; count++) {
>> 		res = read()
>> 		if (res is success) break
>> 		mdelay(10)              <-- if it changes here, we miss it on the last iteration
>> 	}
>>
>> It's not necessary to fix this, but it is good to point out
>> that this is currently the case.
>>
>>> In other words this code is what the hardware engineers recommended, is
>>> used *much* more often outside the driver and known to be working well.
>> Ah, yes, I didn't want to resort to the "hardware engineers recommended",
>> but here is what I've been told by hardware engineers (as seen by their
>> interface):
>>
>> 1. Write your request to the hardware.
>> 2. Wait M ms for the hardware to do its thing.
>> 3. Read back and check status.
>> 4. If done, then you're good.
>> 5. If not done, wait a tiny bit of time, N, and
>>     then check again, until done or timeout.
>>
>> Usually N << M, since M is the actual time for
>> hardware do do something, and N is a lot smaller
>> than M, since it is just back-off time as
>> the hardware can be done "anytime now".
>>
>> As a waveform it would look like this:
>>
>> Immediate success: WRITE, 10, READ.
>> Delayed success:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
>> Delayed failure:   WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
>>
>> The do-while loop above accomplishes this. The loop always ends on a read.
>> The current code would end on a delay on a failure, like this:
>> Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10.
>>
>> But, as it is working right now "for a decade",
>> it can be left as is, as long as we realize that
>> the "last" delay is not necessary, when we're failing.
>>
>>>> I would've also liked to see the mutex business fixed
>>>> as well from my original patch review, as it is easy to prove
>>>> that it is always taken, so no need to embed it inside the if().
>>> That is indeed a software problem which should be fixed.
>> I've not seen a patch fixing this yet.
>>
>> Regards,
>> Luben
>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>> Luben
>>>>
>>>>> Christian.
>>>>>
>>>>>>     	/* Wait for CTLACK and CTLACK2 to get asserted */
>>>>>>     	for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
>>>>>>     		uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-06 13:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26 16:12 [PATCH] drm/amdgpu: Adding wait time before reading upll control register Alex Jivin
2020-06-26 17:04 ` Christian König
2020-06-29 22:46   ` Luben Tuikov
2020-06-30  7:01     ` Christian König
2020-07-06 12:42       ` Luben Tuikov
2020-07-06 12:59         ` Christian König
2020-07-06 13:53           ` Luben Tuikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.