* [PATCH] drm/amdgpu: Adding wait time before reading upll control register @ 2020-06-26 16:12 Alex Jivin 2020-06-26 17:04 ` Christian König 0 siblings, 1 reply; 7+ messages in thread From: Alex Jivin @ 2020-06-26 16:12 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher, Luben Tuikov Adding a delay between writing to UVD control register and reading from it. This is to allow the HW to process the write command. Signed-off-by: Alex Jivin <alex.jivin@amd.com> Suggested-By: Luben Tukov <luben.tuikov@amd.com> --- drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c index 9d7b4ccd17b8..42cdc14fb79d 100644 --- a/drivers/gpu/drm/amd/amdgpu/si.c +++ b/drivers/gpu/drm/amd/amdgpu/si.c @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev, /* Assert UPLL_CTLREQ */ WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK); + /* Waiting for HW to process the previous write. + * This is to give a chance to HW to act before + * the first read is done. + */ + mdelay(1); + /* Wait for CTLACK and CTLACK2 to get asserted */ for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) { uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK; -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register 2020-06-26 16:12 [PATCH] drm/amdgpu: Adding wait time before reading upll control register Alex Jivin @ 2020-06-26 17:04 ` Christian König 2020-06-29 22:46 ` Luben Tuikov 0 siblings, 1 reply; 7+ messages in thread From: Christian König @ 2020-06-26 17:04 UTC (permalink / raw) To: Alex Jivin, amd-gfx; +Cc: Alex Deucher, Luben Tuikov Am 26.06.20 um 18:12 schrieb Alex Jivin: > Adding a delay between writing to UVD control register and reading from it. > This is to allow the HW to process the write command. > > Signed-off-by: Alex Jivin <alex.jivin@amd.com> > Suggested-By: Luben Tukov <luben.tuikov@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c > index 9d7b4ccd17b8..42cdc14fb79d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/si.c > +++ b/drivers/gpu/drm/amd/amdgpu/si.c > @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev, > /* Assert UPLL_CTLREQ */ > WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK); > > + /* Waiting for HW to process the previous write. > + * This is to give a chance to HW to act before > + * the first read is done. > + */ > + mdelay(1); > + Mhm, that is most likely not a good idea. We need to issue a read after the write to make sure that the stuff is send out to the hardware. Adding a delay here is probably just postponing that. Do we have some note in the documentation that this is necessary? Christian. > /* Wait for CTLACK and CTLACK2 to get asserted */ > for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) { > uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register 2020-06-26 17:04 ` Christian König @ 2020-06-29 22:46 ` Luben Tuikov 2020-06-30 7:01 ` Christian König 0 siblings, 1 reply; 7+ messages in thread From: Luben Tuikov @ 2020-06-29 22:46 UTC (permalink / raw) To: christian.koenig, Alex Jivin, amd-gfx; +Cc: Alex Deucher On 2020-06-26 1:04 p.m., Christian König wrote: > Am 26.06.20 um 18:12 schrieb Alex Jivin: >> Adding a delay between writing to UVD control register and reading from it. >> This is to allow the HW to process the write command. >> >> Signed-off-by: Alex Jivin <alex.jivin@amd.com> >> Suggested-By: Luben Tukov <luben.tuikov@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c >> index 9d7b4ccd17b8..42cdc14fb79d 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/si.c >> +++ b/drivers/gpu/drm/amd/amdgpu/si.c >> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev, >> /* Assert UPLL_CTLREQ */ >> WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK); >> >> + /* Waiting for HW to process the previous write. >> + * This is to give a chance to HW to act before >> + * the first read is done. >> + */ >> + mdelay(1); >> + > > Mhm, that is most likely not a good idea. > > We need to issue a read after the write to make sure that the stuff is > send out to the hardware. Tracing down WREG32_P(), it seems to be writing and then reading the register, back to back, twice over, when deferring to PCIe space, and just writel() when in mmio space. (There is separate thread on this as it doesn't seem correct.) > Adding a delay here is probably just postponing that. Do we have some > note in the documentation that this is necessary? Flushing the write buffer (by issuing a read if necessary) is different than waiting for the hardware to process the request. The current code does flush the write buffer by reading back, and then delaying (both in the loop), which does achieve this, but it leaves a corner case as I wrote in my review. The corner case is that if the status register changes to "done" while in the last "delay()" we then check the loop invariant and exit, as opposed to reading the register and determining that it is done successfully. Current, all in pseudo-code: write() for (count = 0; count < MAX_COUNT; count++) { res = read() if (res is success) break mdelay(10) <-- if it changes here, we miss it on the last iteration } Optimal: write() ; assume write buffer flush mdelay(9) do { mdelay(1) res = read() } while (res != success && ++count < MAX_COUNT) This solves the corner case, and ensures a time delay of 10 for the hardware to process its job, but a delay of 1 for polling the status, as it could be done "anytime now." Assuming that WREG32_P() flushes the write buffer, as it seems that it does, the idea here is to give the hardware some time to process the request (from writing a value to it), but when polling to poll a shorter amount of time. I would've also liked to see the mutex business fixed as well from my original patch review, as it is easy to prove that it is always taken, so no need to embed it inside the if(). Regards, Luben > > Christian. > >> /* Wait for CTLACK and CTLACK2 to get asserted */ >> for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) { >> uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK; > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register 2020-06-29 22:46 ` Luben Tuikov @ 2020-06-30 7:01 ` Christian König 2020-07-06 12:42 ` Luben Tuikov 0 siblings, 1 reply; 7+ messages in thread From: Christian König @ 2020-06-30 7:01 UTC (permalink / raw) To: Luben Tuikov, Alex Jivin, amd-gfx; +Cc: Alex Deucher Am 30.06.20 um 00:46 schrieb Luben Tuikov: > On 2020-06-26 1:04 p.m., Christian König wrote: >> Am 26.06.20 um 18:12 schrieb Alex Jivin: >>> Adding a delay between writing to UVD control register and reading from it. >>> This is to allow the HW to process the write command. >>> >>> Signed-off-by: Alex Jivin <alex.jivin@amd.com> >>> Suggested-By: Luben Tukov <luben.tuikov@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c >>> index 9d7b4ccd17b8..42cdc14fb79d 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/si.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/si.c >>> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev, >>> /* Assert UPLL_CTLREQ */ >>> WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK); >>> >>> + /* Waiting for HW to process the previous write. >>> + * This is to give a chance to HW to act before >>> + * the first read is done. >>> + */ >>> + mdelay(1); >>> + >> Mhm, that is most likely not a good idea. >> >> We need to issue a read after the write to make sure that the stuff is >> send out to the hardware. > Tracing down WREG32_P(), it seems to be writing and then reading the register, Why do you think so? What the macro should do is to read, apply mask and value and then write it back to the register: #define WREG32_P(reg, val, mask) \ do { \ uint32_t tmp_ = RREG32(reg); \ tmp_ &= (mask); \ tmp_ |= ((val) & ~(mask)); \ WREG32(reg, tmp_); \ } while (0) > back to back, twice over, when deferring to PCIe space, and just writel() when in mmio > space. (There is separate thread on this as it doesn't seem correct.) That indeed sounds fishy, but all those registers should be in mmio space. > >> Adding a delay here is probably just postponing that. Do we have some >> note in the documentation that this is necessary? > Flushing the write buffer (by issuing a read if necessary) is different > than waiting for the hardware to process the request. It's not flushing the write buffer. The hardware is often build in a way that the next read is triggering the action instead of the write. That's done because writes should be accepted by the hardware block immediately while reads have a timeout associated with it. > The current code does flush the write buffer by reading back, > and then delaying (both in the loop), which does achieve this, > but it leaves a corner case as I wrote in my review. > The corner case is that if the status > register changes to "done" while in the last "delay()" > we then check the loop invariant and exit, as opposed to reading > the register and determining that it is done successfully. > > Current, all in pseudo-code: > > write() > > for (count = 0; count < MAX_COUNT; count++) { > res = read() > if (res is success) break > mdelay(10) <-- if it changes here, we miss it on the last iteration > } > > Optimal: > > write() ; assume write buffer flush > mdelay(9) > do { > mdelay(1) > res = read() > } while (res != success && ++count < MAX_COUNT) > > This solves the corner case, and ensures a time delay of 10 for > the hardware to process its job, but a delay of 1 for polling > the status, as it could be done "anytime now." That seems to assume that the ACK bits are not immediately cleared on the write. How do you got to that conclusion? Processing the request can take even longer than 10ms depending on what is done, which reference clock is selected, the temperature of the hardware, sleep mode etc etc... For example switching to bypass mode usually comes immediately, e.g. you write the CNTL register and you immediately get an ack back. But as far as I know switching from bypass back to normal means that we need to wait for both the reference, feedback and output signal to be logical low and then high again at the same time. Depending on the frequencies we got here that can take a bit of time. > Assuming that WREG32_P() flushes the write buffer, as it seems > that it does, the idea here is to give the hardware some time to process > the request (from writing a value to it), but when polling to poll > a shorter amount of time. Not a bad idea, but this is based on the diagnostic code and already used for nearly a decade. In other words this code is what the hardware engineers recommended, is used *much* more often outside the driver and known to be working well. > I would've also liked to see the mutex business fixed > as well from my original patch review, as it is easy to prove > that it is always taken, so no need to embed it inside the if(). That is indeed a software problem which should be fixed. Regards, Christian. > > Regards, > Luben > >> Christian. >> >>> /* Wait for CTLACK and CTLACK2 to get asserted */ >>> for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) { >>> uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register 2020-06-30 7:01 ` Christian König @ 2020-07-06 12:42 ` Luben Tuikov 2020-07-06 12:59 ` Christian König 0 siblings, 1 reply; 7+ messages in thread From: Luben Tuikov @ 2020-07-06 12:42 UTC (permalink / raw) To: Christian König, Alex Jivin, amd-gfx; +Cc: Alex Deucher On 2020-06-30 3:01 a.m., Christian König wrote: > Am 30.06.20 um 00:46 schrieb Luben Tuikov: >> On 2020-06-26 1:04 p.m., Christian König wrote: >>> Am 26.06.20 um 18:12 schrieb Alex Jivin: >>>> Adding a delay between writing to UVD control register and reading from it. >>>> This is to allow the HW to process the write command. >>>> >>>> Signed-off-by: Alex Jivin <alex.jivin@amd.com> >>>> Suggested-By: Luben Tukov <luben.tuikov@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/si.c | 6 ++++++ >>>> 1 file changed, 6 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c >>>> index 9d7b4ccd17b8..42cdc14fb79d 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/si.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/si.c >>>> @@ -1435,6 +1435,12 @@ static int si_uvd_send_upll_ctlreq(struct amdgpu_device *adev, >>>> /* Assert UPLL_CTLREQ */ >>>> WREG32_P(cg_upll_func_cntl, UPLL_CTLREQ_MASK, ~UPLL_CTLREQ_MASK); >>>> >>>> + /* Waiting for HW to process the previous write. >>>> + * This is to give a chance to HW to act before >>>> + * the first read is done. >>>> + */ >>>> + mdelay(1); >>>> + >>> Mhm, that is most likely not a good idea. >>> >>> We need to issue a read after the write to make sure that the stuff is >>> send out to the hardware. ^^^ This is (A). >> Tracing down WREG32_P(), it seems to be writing and then reading the register, > > Why do you think so? soc15_pcie_wreg() shows it. But as I said, this may have been broken by this 2eee0229f65e897134566888e5321bcb3af0df7a patch and there's a separate thread on this. > > What the macro should do is to read, apply mask and value and then write > it back to the register: > > #define WREG32_P(reg, val, mask) \ > do { \ > uint32_t tmp_ = RREG32(reg); \ > tmp_ &= (mask); \ > tmp_ |= ((val) & ~(mask)); \ > WREG32(reg, tmp_); \ > } while (0) > >> back to back, twice over, when deferring to PCIe space, and just writel() when in mmio >> space. (There is separate thread on this as it doesn't seem correct.) > > That indeed sounds fishy, but all those registers should be in mmio space. > >> >>> Adding a delay here is probably just postponing that. Do we have some >>> note in the documentation that this is necessary? >> Flushing the write buffer (by issuing a read if necessary) is different >> than waiting for the hardware to process the request. > > It's not flushing the write buffer. The hardware is often build in a way > that the next read is triggering the action instead of the write. I don't think that the "read is triggering the action", in this case. The write is triggering the action, when it makes it to the hardware. Also, often enough, the coherency unit is built with dependency detection. This means that a read following a write to the same "hardware" address, would cause the read op to trigger the flushing of the write buffer if the address is still in the write buffer. And then the read would be issued, following the flush, usually by the bus unit (BU). (So the bus would show a bunch of WRITES followed by our READ.) You seem to agree as you said that in (A) above. > That's done because writes should be accepted by the hardware block > immediately while reads have a timeout associated with it. That's is true but irrelevant here. As the code is present right now in the kernel driver, if the hardware changes state on the last delay of 10 ms, you're going to miss that as no checks are done after the delay. The code clearly shows this. All I'm saying is that to avoid this "race" condition, the code should be modified, as there is no point in delaying 10 ms only to quit after that, when count gets to equal MAX_COUNT. >> The current code does flush the write buffer by reading back, >> and then delaying (both in the loop), which does achieve this, >> but it leaves a corner case as I wrote in my review. >> The corner case is that if the status >> register changes to "done" while in the last "delay()" >> we then check the loop invariant and exit, as opposed to reading >> the register and determining that it is done successfully. >> >> Current, all in pseudo-code: >> >> write() >> >> for (count = 0; count < MAX_COUNT; count++) { >> res = read() >> if (res is success) break >> mdelay(10) <-- if it changes here, we miss it on the last iteration >> } >> >> Optimal: >> >> write() ; assume write buffer flush >> mdelay(9) >> do { >> mdelay(1) >> res = read() >> } while (res != success && ++count < MAX_COUNT) >> >> This solves the corner case, and ensures a time delay of 10 for >> the hardware to process its job, but a delay of 1 for polling >> the status, as it could be done "anytime now." > > That seems to assume that the ACK bits are not immediately cleared on > the write. How do you got to that conclusion? No such assumption was taken. This is about the wasted 10 ms delay as pointed to by the current code quoted above; the comment on the line with an arrow in it. > > Processing the request can take even longer than 10ms depending on what > is done, which reference clock is selected, the temperature of the > hardware, sleep mode etc etc... That's why the loop. > [snip!] > >> Assuming that WREG32_P() flushes the write buffer, as it seems >> that it does, the idea here is to give the hardware some time to process >> the request (from writing a value to it), but when polling to poll >> a shorter amount of time. > > Not a bad idea, but this is based on the diagnostic code and already > used for nearly a decade. Yes, I understand that it does work right now. Surely, if you set the loop limit high enough, you can show that you'd iterate at least one over the number required to complete or fail, and thus "used for nearly a decade". But from a computational point of view it is not correct in that the very last delay is a wasted delay: write() for (count = 0; count < MAX_COUNT; count++) { res = read() if (res is success) break mdelay(10) <-- if it changes here, we miss it on the last iteration } It's not necessary to fix this, but it is good to point out that this is currently the case. > In other words this code is what the hardware engineers recommended, is > used *much* more often outside the driver and known to be working well. Ah, yes, I didn't want to resort to the "hardware engineers recommended", but here is what I've been told by hardware engineers (as seen by their interface): 1. Write your request to the hardware. 2. Wait M ms for the hardware to do its thing. 3. Read back and check status. 4. If done, then you're good. 5. If not done, wait a tiny bit of time, N, and then check again, until done or timeout. Usually N << M, since M is the actual time for hardware do do something, and N is a lot smaller than M, since it is just back-off time as the hardware can be done "anytime now". As a waveform it would look like this: Immediate success: WRITE, 10, READ. Delayed success: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ. Delayed failure: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ. The do-while loop above accomplishes this. The loop always ends on a read. The current code would end on a delay on a failure, like this: Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10. But, as it is working right now "for a decade", it can be left as is, as long as we realize that the "last" delay is not necessary, when we're failing. > >> I would've also liked to see the mutex business fixed >> as well from my original patch review, as it is easy to prove >> that it is always taken, so no need to embed it inside the if(). > > That is indeed a software problem which should be fixed. I've not seen a patch fixing this yet. Regards, Luben > > Regards, > Christian. > >> >> Regards, >> Luben >> >>> Christian. >>> >>>> /* Wait for CTLACK and CTLACK2 to get asserted */ >>>> for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) { >>>> uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK; > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register 2020-07-06 12:42 ` Luben Tuikov @ 2020-07-06 12:59 ` Christian König 2020-07-06 13:53 ` Luben Tuikov 0 siblings, 1 reply; 7+ messages in thread From: Christian König @ 2020-07-06 12:59 UTC (permalink / raw) To: Luben Tuikov, Alex Jivin, amd-gfx; +Cc: Alex Deucher Am 06.07.20 um 14:42 schrieb Luben Tuikov: > On 2020-06-30 3:01 a.m., Christian König wrote: >> Am 30.06.20 um 00:46 schrieb Luben Tuikov: >>> On 2020-06-26 1:04 p.m., Christian König wrote: >>>> Am 26.06.20 um 18:12 schrieb Alex Jivin: >>>> [SNIP] >>>> Adding a delay here is probably just postponing that. Do we have some >>>> note in the documentation that this is necessary? >>> Flushing the write buffer (by issuing a read if necessary) is different >>> than waiting for the hardware to process the request. >> It's not flushing the write buffer. The hardware is often build in a way >> that the next read is triggering the action instead of the write. > I don't think that the "read is triggering the action", in this case. > The write is triggering the action, when it makes it to the hardware. > > Also, often enough, the coherency unit is built with dependency detection. > This means that a read following a write to the same "hardware" address, > would cause the read op to trigger the flushing of the write buffer > if the address is still in the write buffer. And then the read > would be issued, following the flush, usually by the bus unit (BU). > (So the bus would show a bunch of WRITES followed by our READ.) > You seem to agree as you said that in (A) above. Yeah, but this is not what I meant here. See a bit below for more description on what our hardware often does. >> That's done because writes should be accepted by the hardware block >> immediately while reads have a timeout associated with it. > That's is true but irrelevant here. No it isn't. See the read of the register acts as a delay as well. The read doesn't comes back to the CPU immediately when there is an ongoing action to wait for. Some of the SDMA, TC and HDP registers even go as far as delaying the return of the read until some memory is flushed. The maximum timeout here is rather large, something in the range of 65ms IIRC. > As the code is present right now in the kernel driver, > if the hardware changes state on the last delay of 10 ms, > you're going to miss that as no checks are done after the delay. > The code clearly shows this. > > All I'm saying is that to avoid this "race" condition, > the code should be modified, as there is no point in delaying 10 > ms only to quit after that, when count gets to equal MAX_COUNT. Well, that is indeed a good point! >>> The current code does flush the write buffer by reading back, >>> and then delaying (both in the loop), which does achieve this, >>> but it leaves a corner case as I wrote in my review. >>> The corner case is that if the status >>> register changes to "done" while in the last "delay()" >>> we then check the loop invariant and exit, as opposed to reading >>> the register and determining that it is done successfully. >>> >>> Current, all in pseudo-code: >>> >>> write() >>> >>> for (count = 0; count < MAX_COUNT; count++) { >>> res = read() >>> if (res is success) break >>> mdelay(10) <-- if it changes here, we miss it on the last iteration >>> } >>> >>> Optimal: >>> >>> write() ; assume write buffer flush >>> mdelay(9) >>> do { >>> mdelay(1) >>> res = read() >>> } while (res != success && ++count < MAX_COUNT) >>> >>> This solves the corner case, and ensures a time delay of 10 for >>> the hardware to process its job, but a delay of 1 for polling >>> the status, as it could be done "anytime now." >> That seems to assume that the ACK bits are not immediately cleared on >> the write. How do you got to that conclusion? > No such assumption was taken. > This is about the wasted 10 ms delay > as pointed to by the current code quoted above; the comment > on the line with an arrow in it. Yeah, now I understand what you are trying to do here. Problem is that this messes up the read delay feature of the hardware with that suggestion. In other words we should not have is a delay between the initial write and the first read since the hardware will take care of this for us. Regards, Christian. > >> Processing the request can take even longer than 10ms depending on what >> is done, which reference clock is selected, the temperature of the >> hardware, sleep mode etc etc... > That's why the loop. > > [snip!] >>> Assuming that WREG32_P() flushes the write buffer, as it seems >>> that it does, the idea here is to give the hardware some time to process >>> the request (from writing a value to it), but when polling to poll >>> a shorter amount of time. >> Not a bad idea, but this is based on the diagnostic code and already >> used for nearly a decade. > Yes, I understand that it does work right now. Surely, > if you set the loop limit high enough, you can show > that you'd iterate at least one over the number required > to complete or fail, and thus "used for nearly a decade". > > But from a computational point of view it is not correct > in that the very last delay is a wasted delay: > > write() > for (count = 0; count < MAX_COUNT; count++) { > res = read() > if (res is success) break > mdelay(10) <-- if it changes here, we miss it on the last iteration > } > > It's not necessary to fix this, but it is good to point out > that this is currently the case. > >> In other words this code is what the hardware engineers recommended, is >> used *much* more often outside the driver and known to be working well. > Ah, yes, I didn't want to resort to the "hardware engineers recommended", > but here is what I've been told by hardware engineers (as seen by their > interface): > > 1. Write your request to the hardware. > 2. Wait M ms for the hardware to do its thing. > 3. Read back and check status. > 4. If done, then you're good. > 5. If not done, wait a tiny bit of time, N, and > then check again, until done or timeout. > > Usually N << M, since M is the actual time for > hardware do do something, and N is a lot smaller > than M, since it is just back-off time as > the hardware can be done "anytime now". > > As a waveform it would look like this: > > Immediate success: WRITE, 10, READ. > Delayed success: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ. > Delayed failure: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ. > > The do-while loop above accomplishes this. The loop always ends on a read. > The current code would end on a delay on a failure, like this: > Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10. > > But, as it is working right now "for a decade", > it can be left as is, as long as we realize that > the "last" delay is not necessary, when we're failing. > >>> I would've also liked to see the mutex business fixed >>> as well from my original patch review, as it is easy to prove >>> that it is always taken, so no need to embed it inside the if(). >> That is indeed a software problem which should be fixed. > I've not seen a patch fixing this yet. > > Regards, > Luben > >> Regards, >> Christian. >> >>> Regards, >>> Luben >>> >>>> Christian. >>>> >>>>> /* Wait for CTLACK and CTLACK2 to get asserted */ >>>>> for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) { >>>>> uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] drm/amdgpu: Adding wait time before reading upll control register 2020-07-06 12:59 ` Christian König @ 2020-07-06 13:53 ` Luben Tuikov 0 siblings, 0 replies; 7+ messages in thread From: Luben Tuikov @ 2020-07-06 13:53 UTC (permalink / raw) To: Christian König, Alex Jivin, amd-gfx; +Cc: Alex Deucher On 2020-07-06 8:59 a.m., Christian König wrote: > Am 06.07.20 um 14:42 schrieb Luben Tuikov: >> On 2020-06-30 3:01 a.m., Christian König wrote: >>> Am 30.06.20 um 00:46 schrieb Luben Tuikov: >>>> On 2020-06-26 1:04 p.m., Christian König wrote: >>>>> Am 26.06.20 um 18:12 schrieb Alex Jivin: >>>>> [SNIP] >>>>> Adding a delay here is probably just postponing that. Do we have some >>>>> note in the documentation that this is necessary? >>>> Flushing the write buffer (by issuing a read if necessary) is different >>>> than waiting for the hardware to process the request. >>> It's not flushing the write buffer. The hardware is often build in a way >>> that the next read is triggering the action instead of the write. >> I don't think that the "read is triggering the action", in this case. >> The write is triggering the action, when it makes it to the hardware. >> >> Also, often enough, the coherency unit is built with dependency detection. >> This means that a read following a write to the same "hardware" address, >> would cause the read op to trigger the flushing of the write buffer >> if the address is still in the write buffer. And then the read >> would be issued, following the flush, usually by the bus unit (BU). >> (So the bus would show a bunch of WRITES followed by our READ.) >> You seem to agree as you said that in (A) above. > > Yeah, but this is not what I meant here. > > See a bit below for more description on what our hardware often does. This hardware isn't wired to "GO" on a READ following a WRITE, as you alluded to in your previous email in this thread. > >>> That's done because writes should be accepted by the hardware block >>> immediately while reads have a timeout associated with it. >> That's is true but irrelevant here. > > No it isn't. See the read of the register acts as a delay as well. > > The read doesn't comes back to the CPU immediately when there is an > ongoing action to wait for. There is no interlock in this hardware to wait for the status of the "ongoing action to wait for", as you put it. The READ result comes back right away, so long there is *any* status to return, "not done", "no status", "retry", "fail", "done", etc. [snip!] > >> As the code is present right now in the kernel driver, >> if the hardware changes state on the last delay of 10 ms, >> you're going to miss that as no checks are done after the delay. >> The code clearly shows this. >> >> All I'm saying is that to avoid this "race" condition, >> the code should be modified, as there is no point in delaying 10 >> ms only to quit after that, when count gets to equal MAX_COUNT. > > Well, that is indeed a good point! That was also in my original patch review. > >>>> The current code does flush the write buffer by reading back, >>>> and then delaying (both in the loop), which does achieve this, >>>> but it leaves a corner case as I wrote in my review. >>>> The corner case is that if the status >>>> register changes to "done" while in the last "delay()" >>>> we then check the loop invariant and exit, as opposed to reading >>>> the register and determining that it is done successfully. >>>> >>>> Current, all in pseudo-code: >>>> >>>> write() >>>> >>>> for (count = 0; count < MAX_COUNT; count++) { >>>> res = read() >>>> if (res is success) break >>>> mdelay(10) <-- if it changes here, we miss it on the last iteration >>>> } >>>> >>>> Optimal: >>>> >>>> write() ; assume write buffer flush >>>> mdelay(9) >>>> do { >>>> mdelay(1) >>>> res = read() >>>> } while (res != success && ++count < MAX_COUNT) >>>> >>>> This solves the corner case, and ensures a time delay of 10 for >>>> the hardware to process its job, but a delay of 1 for polling >>>> the status, as it could be done "anytime now." >>> That seems to assume that the ACK bits are not immediately cleared on >>> the write. How do you got to that conclusion? >> No such assumption was taken. >> This is about the wasted 10 ms delay >> as pointed to by the current code quoted above; the comment >> on the line with an arrow in it. > > Yeah, now I understand what you are trying to do here. Problem is that > this messes up the read delay feature of the hardware with that suggestion. I doubt the existence of any "read delay feature", as that would stall/sputter the CPU. The WRITE-READ pair, forces the dependency to be resolved and flushes the write buffer if one is used for that memory. As the CPU and the bus are very fast, and the action to be performed by the hardware may take ms (as indicated by the timeout and the loop), the 1st READ result comes back with "not done", almost guaranteed, unless there was no action to be performed. > > In other words we should not have is a delay between the initial write > and the first read since the hardware will take care of this for us. Sure, we can have a WRITE-READ pair, *to flush the write buffer*. Regards, Luben > > Regards, > Christian. > >> >>> Processing the request can take even longer than 10ms depending on what >>> is done, which reference clock is selected, the temperature of the >>> hardware, sleep mode etc etc... >> That's why the loop. >> >> [snip!] >>>> Assuming that WREG32_P() flushes the write buffer, as it seems >>>> that it does, the idea here is to give the hardware some time to process >>>> the request (from writing a value to it), but when polling to poll >>>> a shorter amount of time. >>> Not a bad idea, but this is based on the diagnostic code and already >>> used for nearly a decade. >> Yes, I understand that it does work right now. Surely, >> if you set the loop limit high enough, you can show >> that you'd iterate at least one over the number required >> to complete or fail, and thus "used for nearly a decade". >> >> But from a computational point of view it is not correct >> in that the very last delay is a wasted delay: >> >> write() >> for (count = 0; count < MAX_COUNT; count++) { >> res = read() >> if (res is success) break >> mdelay(10) <-- if it changes here, we miss it on the last iteration >> } >> >> It's not necessary to fix this, but it is good to point out >> that this is currently the case. >> >>> In other words this code is what the hardware engineers recommended, is >>> used *much* more often outside the driver and known to be working well. >> Ah, yes, I didn't want to resort to the "hardware engineers recommended", >> but here is what I've been told by hardware engineers (as seen by their >> interface): >> >> 1. Write your request to the hardware. >> 2. Wait M ms for the hardware to do its thing. >> 3. Read back and check status. >> 4. If done, then you're good. >> 5. If not done, wait a tiny bit of time, N, and >> then check again, until done or timeout. >> >> Usually N << M, since M is the actual time for >> hardware do do something, and N is a lot smaller >> than M, since it is just back-off time as >> the hardware can be done "anytime now". >> >> As a waveform it would look like this: >> >> Immediate success: WRITE, 10, READ. >> Delayed success: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ. >> Delayed failure: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ. >> >> The do-while loop above accomplishes this. The loop always ends on a read. >> The current code would end on a delay on a failure, like this: >> Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10. >> >> But, as it is working right now "for a decade", >> it can be left as is, as long as we realize that >> the "last" delay is not necessary, when we're failing. >> >>>> I would've also liked to see the mutex business fixed >>>> as well from my original patch review, as it is easy to prove >>>> that it is always taken, so no need to embed it inside the if(). >>> That is indeed a software problem which should be fixed. >> I've not seen a patch fixing this yet. >> >> Regards, >> Luben >> >>> Regards, >>> Christian. >>> >>>> Regards, >>>> Luben >>>> >>>>> Christian. >>>>> >>>>>> /* Wait for CTLACK and CTLACK2 to get asserted */ >>>>>> for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) { >>>>>> uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK; > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-07-06 13:53 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-26 16:12 [PATCH] drm/amdgpu: Adding wait time before reading upll control register Alex Jivin 2020-06-26 17:04 ` Christian König 2020-06-29 22:46 ` Luben Tuikov 2020-06-30 7:01 ` Christian König 2020-07-06 12:42 ` Luben Tuikov 2020-07-06 12:59 ` Christian König 2020-07-06 13:53 ` Luben Tuikov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.