From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6B4FB10E203 for ; Wed, 3 Aug 2022 07:21:09 +0000 (UTC) From: Janusz Krzysztofik To: Karolina Drobnik Date: Wed, 03 Aug 2022 09:21:03 +0200 Message-ID: <8978773.CDJkKcVGEf@jkrzyszt-mobl1.ger.corp.intel.com> In-Reply-To: <8384c9cb-e870-6a3a-15cb-c223bf6243eb@intel.com> References: <2468862.4XsnlVU6TS@jkrzyszt-mobl1.ger.corp.intel.com> <8384c9cb-e870-6a3a-15cb-c223bf6243eb@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Subject: Re: [igt-dev] [PATCH i-g-t v2 1/2] tests/gem_exec_fence: Check stored values only for valid workloads List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: igt-dev@lists.freedesktop.org, Chris Wilson , Tvrtko Ursulin Errors-To: igt-dev-bounces@lists.freedesktop.org Sender: "igt-dev" List-ID: On Tuesday, 2 August 2022 12:20:29 CEST Karolina Drobnik wrote: > Hi, > > On 01.08.2022 16:43, Janusz Krzysztofik wrote: > > On Monday, 1 August 2022 15:39:36 CEST Karolina Drobnik wrote: > >> Hi, > >> > >> On 01.08.2022 13:54, Janusz Krzysztofik wrote: > >>> On Monday, 1 August 2022 11:51:27 CEST Karolina Drobnik wrote: > >>>> Hi, > >>>> > >>>> On 01.08.2022 10:11, Janusz Krzysztofik wrote: > >>>>> On Monday, 1 August 2022 07:53:54 CEST Karolina Drobnik wrote: > >>>>>> Hi, > >>>>>> > >>>>>> On 29.07.2022 17:23, Janusz Krzysztofik wrote: > >>>>>>> On Friday, 29 July 2022 13:32:37 CEST Karolina Drobnik wrote: > >>>>>>>> Hi Janusz, > >>>>>>>> > >>>>>>>> On 29.07.2022 10:24, Janusz Krzysztofik wrote: > >>>>>>>>> Hi Karolina, > >>>>>>>>> > >>>>>>>>> On Friday, 29 July 2022 09:38:43 CEST Karolina Drobnik wrote: > >>>>>>>>>> Hi Janusz, > >>>>>>>>>> > >>>>>>>>>> Thanks a lot for taking a look at the patch. > >>>>>>>>>> > >>>>>>>>>> On 28.07.2022 18:56, Janusz Krzysztofik wrote: > >>>>>>>>>>> On Tuesday, 26 July 2022 12:13:11 CEST Karolina Drobnik wrote: > >>>>>>>>>>>> From: Chris Wilson > >>>>>>>>>>>> > >>>>>>>>>>>> test_fence_await verifies if a fence used to pipeline work signals > >>>>>>>>>>>> correctly. await-hang and nb-await-hang test cases inject GPU hang, > >>>>>>>>>>>> which causes an erroneous state, meaning the fence will be signaled > >>>>>>>>>>>> without execution. The error causes an instant reset of the command > >>>>>>>>>>>> streamer for the hanging workload. This revealed a problem with how > >>>>>>>>>>>> we verify the fence state and results. The test assumes that the > >>>>>>>>>>>> error notification happens after a hang is declared, which takes > >>>>>>>>>>>> longer than submitting the next set of fences, making the test pass > >>>>>>>>>>>> every time. With the immediate reset, this might not happen, so the > >>>>>>>>>>>> assertion fails, as there are no active fences in the GPU hang case. > >>>>>>>>>>>> > >>>>>>>>>>>> Move the check for active fence to the path for non-hanging workload, > >>>>>>>>>>>> and verify results only in this scenario. > >>>>>>>>>>>> > >>>>>>>>>>>> Signed-off-by: Chris Wilson > >>>>>>>>>>>> Signed-off-by: Karolina Drobnik > >>>>>>>>>>>> --- > >>>>>>>>>>>> tests/i915/gem_exec_fence.c | 14 ++++++++------ > >>>>>>>>>>>> 1 file changed, 8 insertions(+), 6 deletions(-) > >>>>>>>>>>>> > >>>>>>>>>>>> diff --git a/tests/i915/gem_exec_fence.c b/tests/i915/ > >>>>> gem_exec_fence.c > >>>>>>>>>>>> index d46914c2..260aa82c 100644 > >>>>>>>>>>>> --- a/tests/i915/gem_exec_fence.c > >>>>>>>>>>>> +++ b/tests/i915/gem_exec_fence.c > >>>>>>>>>>>> @@ -350,18 +350,20 @@ static void test_fence_await(int fd, const > >>>>>>>>> intel_ctx_t > >>>>>>>>>>> *ctx, > >>>>>> (...) > >>>>>>>>>>>> > >>>>>>>>>>>> - if ((flags & HANG) == 0) > >>>>>>>>>>>> igt_spin_end(spin); > >>>>>>>>>>>> + } > >>>>>>>>>>>> > >>>>>>>>>>>> igt_waitchildren(); > >>>>>>>>>>>> > >>>>>>>>>>>> gem_set_domain(fd, scratch, I915_GEM_DOMAIN_GTT, 0); > >>>>>>>>>>>> - while (i--) > >>>>>>>>>>>> + igt_assert(!fence_busy(spin->out_fence)); > >>>>>>>>>>> > >>>>>>>>>>> We only check that the out fence of the presumably hanged spin batch > >>>>> no > >>>>>>>>> longer > >>>>>>>>>>> blocks execution of store batches. > >>>>>>>>>> > >>>>>>>>>> This check applies to all workloads, all of them should be done with > >>>>>>>>>> work at this point > >>>>>>>>> > >>>>>>>>> OK, but since that's the only explicit assert in the *-hang processing > >>>>> path, > >>>>>>>>> does it tell us anything about fencing working or not? > >>>>>>>> > >>>>>>>> It says that we were given an active fence, we wait at it and hope it > >>>>>>>> signals when an error is reported. Like I said, we can't check the > >>>>>>>> results itself, as they are meaningless with the reset. If we have an > >>>>>>>> active fence at this point, that's bad, and the test should fail. > >>>>>>>> > >>>>>>>>> I think it doesn't, > >>>>>>>>> and as long as I'm not wrong, I think such scenarios hardly belong to > >>>>>>>>> gem_exec_fence. > >>>>>>>> > >>>>>>>> Hm, I'm not sure if I follow, but this exact transition (from active -> > >>>>>>>> (record error) -> signal) is one of the possible scenarios we wish to > >>>>>>>> test. > >>>>>>> > >>>>>>> OK, so we check if an active fence is signalled on error. But then, what > >>>>> does > >>>>>>> 'active' mean here? > >>>>>> > >>>>>> Active means that the fence was sent and we wait to hear back from the > >>>>>> GPU. If you'd like to see what are the state transitions in fences, you > >>>>>> can check out Mainline Explicit Fencing series of blog posts/ presentation. > >>>>>> > >>>>>>> Do we consider a fence active as soon as it has been > >>>>>>> exported to userspace? Or only after it has been imported back from > >>>>> userspace > >>>>>>> by at least one consumer? > >>>>>> > >>>>>> We assume them to be active as soon as the fence's fd is returned from > >>>>>> the driver to userspace (we deal with out-fences here). > >>>>>> > >>>>>>> Assuming the former (as I guess), what do we need > >>>>>>> the store batches for in these now modified *await-hang scenarios? What > >>>>> extra > >>>>>>> value do those scenarios provide compared to (nb-)?wait-hang ? > >>>>>> > >>>>>> Hm, I don't quite understand the question here -- are you asking why do > >>>>>> we check the store results? test_fence_await is reused by other > >>>>>> subtests, not only the hanging cases, and we expect to see the stores. > >>>>> > >>>>> The patch introduces two processing paths to test_fence_await(), one of them > >>>>> for *-hang variants. In that *-hang processing path, why do we still submit > >>>>> batches that depend on the fence if we don't examine their execution and > >>>>> results, only their completion? > >>>> > >>>> Ah, OK, now I get it. Do you suggest that we should also add a check for > >>>> HANG flag for igt_store_word block? But we still need to send something > >>>> to test this case. > >>> > >>> Why do we need to send something? What that something should look like? IOW, > >>> what are we trying to exercise with this case? How is it supposed to be > >>> different from wait-hang scenario? > >> > >> test_fence_await tests pipelining work (here, writing something to > >> memory in parallel) and cross-checks its status with what's reported by > >> the fence, > > > > True for *await (no HANG) subtests, but in *await-hang variants I still can't > > recognize checks in place as cross-checks and what value they add compared to > > wait-hang. > > wait_hang and await_hang are different in the sense of work they do. > test_fence_busy is a test with a simple batch buffer (with no actual > writes) that is _iteratively_ executed on different physical engines, > whereas test_fence_await spawns work on different engines in _parallel_. > It's a different scenario from the fence's perspective, as we have one > fence on the spinner and pass it to other engines which (asynchronously) > wait on it. The value added here is the check for 1 fence - many engines > (wait_hang has 1-1 mapping). > > > Does a check for invalidated work no longer queued nor running > > tell us anything about the fence behavior, i.e., if and how it affected / > > interacted with that work? I think it doesn't, then, I'm not sure if we still > > need such limited scenarios. > > I don't think I quite understand -- we check if we've sent a hanging > workload, which is expected to result in an error, there's no additional > check for the work itself (how one could check it?). As the error is > raised from any of the engines (I'm taking about the await case), the > fence is expected to signal, and this is what we check. The fence itself > can't interact with that work > > > Please convince me that we do and I'll be happy > > to push the series with my S-o-b: added. > > I have commit rights now, so I can push it myself, OK, let's take a different approach then. Please feel free to push the series with R-b:s collected so far yourself, and I'll try to update and restore disabled checks in a follow-up patch. Thanks, Janusz > but I want to address > all the comments before moving on. > > If you have further questions or my answers are not satisfactory (which > is fair), you can reach out to the author. He knows all the bits and > pieces of this test suite. I'm more of a messenger here. > > All the best, > Karolina > > > Thanks, > > Janusz >