All of lore.kernel.org
 help / color / mirror / Atom feed
* Intermittent errors when using amdgpu_job_submit_direct
@ 2019-07-09  4:53 Kuehling, Felix
       [not found] ` <885956af-be59-d218-f2e7-a0fc06042f21-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Kuehling, Felix @ 2019-07-09  4:53 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

I'm seeing some weird intermittent bugs (vm faults, hangs, etc) when 
trying to use amdgpu_job_submit_direct. I'm wondering if there is a 
possibility of a race condition, when a submit_direct and a GPU 
scheduler thread try to submit to the same ring at the same time. I 
didn't see any locking to allow multiple threads safely submitting to 
the same ring.

Am I missing something?

Thanks,
   Felix

-- 
F e l i x   K u e h l i n g
PMTS Software Development Engineer | Linux Compute Kernel
1 Commerce Valley Dr. East, Markham, ON L3T 7X6 Canada
(O) +1(289)695-1597
    _     _   _   _____   _____
   / \   | \ / | |  _  \  \ _  |
  / A \  | \M/ | | |D) )  /|_| |
/_/ \_\ |_| |_| |_____/ |__/ \|   facebook.com/AMD | amd.com

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Intermittent errors when using amdgpu_job_submit_direct
       [not found] ` <885956af-be59-d218-f2e7-a0fc06042f21-5C7GfCeVMHo@public.gmane.org>
@ 2019-07-09 12:58   ` Chunming Zhou
       [not found]     ` <affc1656-4696-846e-baca-48331aef3043-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Chunming Zhou @ 2019-07-09 12:58 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

I've raised it up when Christian make page fault, at that patch, 
amdgpu_job_submit_direct uses exclusive page fault ring for that.

But if you use amdgpu_job_submit_direct for gerneral rings ocuppied by 
scheduler, I guess varias bugs will happen.

-David

在 2019/7/9 12:53, Kuehling, Felix 写道:
> I'm seeing some weird intermittent bugs (vm faults, hangs, etc) when
> trying to use amdgpu_job_submit_direct. I'm wondering if there is a
> possibility of a race condition, when a submit_direct and a GPU
> scheduler thread try to submit to the same ring at the same time. I
> didn't see any locking to allow multiple threads safely submitting to
> the same ring.
>
> Am I missing something?
>
> Thanks,
>     Felix
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Intermittent errors when using amdgpu_job_submit_direct
       [not found]     ` <affc1656-4696-846e-baca-48331aef3043-5C7GfCeVMHo@public.gmane.org>
@ 2019-07-09 19:26       ` Kuehling, Felix
       [not found]         ` <bcaf471b-bc25-02c0-4547-b756bbf42bf6-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Kuehling, Felix @ 2019-07-09 19:26 UTC (permalink / raw)
  To: Zhou, David(ChunMing), amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 2019-07-09 8:58 a.m., Zhou, David(ChunMing) wrote:
> I've raised it up when Christian make page fault, at that patch,
> amdgpu_job_submit_direct uses exclusive page fault ring for that.
>
> But if you use amdgpu_job_submit_direct for gerneral rings ocuppied by
> scheduler, I guess varias bugs will happen.

The problem is, even the paging ring is used by the scheduler. There are 
several places where buffer operations are submitted to the paging ring 
through the scheduler. That makes any use of the paging ring through 
direct submission problematic.

Even ignoring the scheduler, if it's possible that multiple threads 
submit to the paging ring, we'll need locking to ensure that the 
contents of the ring remain consistent. IIRC, the rings used to have 
locking before we had a GPU scheduler. For comparison, see 
radeon_ring.c, which still has locking. With the GPU scheduler, the 
rings became single-producer queues that no longer needed locking. But 
with direct submission that is no longer true. I think a good place to 
do that locking now would be in amdgpu_ib_schedule.

Regards,
   Felix


>
> -David
>
> 在 2019/7/9 12:53, Kuehling, Felix 写道:
>> I'm seeing some weird intermittent bugs (vm faults, hangs, etc) when
>> trying to use amdgpu_job_submit_direct. I'm wondering if there is a
>> possibility of a race condition, when a submit_direct and a GPU
>> scheduler thread try to submit to the same ring at the same time. I
>> didn't see any locking to allow multiple threads safely submitting to
>> the same ring.
>>
>> Am I missing something?
>>
>> Thanks,
>>      Felix
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Intermittent errors when using amdgpu_job_submit_direct
       [not found]         ` <bcaf471b-bc25-02c0-4547-b756bbf42bf6-5C7GfCeVMHo@public.gmane.org>
@ 2019-07-10 13:32           ` Chunming Zhou
  0 siblings, 0 replies; 4+ messages in thread
From: Chunming Zhou @ 2019-07-10 13:32 UTC (permalink / raw)
  To: Kuehling, Felix, Zhou, David(ChunMing),
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


在 2019/7/10 3:26, Kuehling, Felix 写道:
> On 2019-07-09 8:58 a.m., Zhou, David(ChunMing) wrote:
>> I've raised it up when Christian make page fault, at that patch,
>> amdgpu_job_submit_direct uses exclusive page fault ring for that.
>>
>> But if you use amdgpu_job_submit_direct for gerneral rings ocuppied by
>> scheduler, I guess varias bugs will happen.
> The problem is, even the paging ring is used by the scheduler. There are
> several places where buffer operations are submitted to the paging ring
> through the scheduler. That makes any use of the paging ring through
> direct submission problematic.
>
> Even ignoring the scheduler, if it's possible that multiple threads
> submit to the paging ring, we'll need locking to ensure that the
> contents of the ring remain consistent. IIRC, the rings used to have
> locking before we had a GPU scheduler. For comparison, see
> radeon_ring.c, which still has locking. With the GPU scheduler, the
> rings became single-producer queues that no longer needed locking. But
> with direct submission that is no longer true. I think a good place to
> do that locking now would be in amdgpu_ib_schedule.

Yes, That is exact reason why we remove ring lock at that moment.

You can add back it when using submit_direct co-existing with scheduler.

-David

>
> Regards,
>     Felix
>
>
>> -David
>>
>> 在 2019/7/9 12:53, Kuehling, Felix 写道:
>>> I'm seeing some weird intermittent bugs (vm faults, hangs, etc) when
>>> trying to use amdgpu_job_submit_direct. I'm wondering if there is a
>>> possibility of a race condition, when a submit_direct and a GPU
>>> scheduler thread try to submit to the same ring at the same time. I
>>> didn't see any locking to allow multiple threads safely submitting to
>>> the same ring.
>>>
>>> Am I missing something?
>>>
>>> Thanks,
>>>       Felix
>>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-07-10 13:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-09  4:53 Intermittent errors when using amdgpu_job_submit_direct Kuehling, Felix
     [not found] ` <885956af-be59-d218-f2e7-a0fc06042f21-5C7GfCeVMHo@public.gmane.org>
2019-07-09 12:58   ` Chunming Zhou
     [not found]     ` <affc1656-4696-846e-baca-48331aef3043-5C7GfCeVMHo@public.gmane.org>
2019-07-09 19:26       ` Kuehling, Felix
     [not found]         ` <bcaf471b-bc25-02c0-4547-b756bbf42bf6-5C7GfCeVMHo@public.gmane.org>
2019-07-10 13:32           ` Chunming Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.