* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
@ 2018-03-19 12:43 ` Chris Wilson
2018-03-19 14:14 ` Lis, Tomasz
2018-03-30 17:29 ` [PATCH " Tomasz Lis
` (6 subsequent siblings)
7 siblings, 1 reply; 88+ messages in thread
From: Chris Wilson @ 2018-03-19 12:43 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
Quoting Tomasz Lis (2018-03-19 12:37:35)
> The patch adds a parameter to control the data port coherency functionality
> on a per-exec call basis. When data port coherency flag value is different
> than what it was in previous call for the context, a command to switch data
> port coherency state is added before the buffer to be executed.
So this is part of the context? Why do it at exec level? If exec level
is desired, why not whitelist it?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-19 12:43 ` Chris Wilson
@ 2018-03-19 14:14 ` Lis, Tomasz
2018-03-19 14:26 ` Chris Wilson
2018-03-20 18:43 ` Oscar Mateo
0 siblings, 2 replies; 88+ messages in thread
From: Lis, Tomasz @ 2018-03-19 14:14 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: bartosz.dunajski
On 2018-03-19 13:43, Chris Wilson wrote:
> Quoting Tomasz Lis (2018-03-19 12:37:35)
>> The patch adds a parameter to control the data port coherency functionality
>> on a per-exec call basis. When data port coherency flag value is different
>> than what it was in previous call for the context, a command to switch data
>> port coherency state is added before the buffer to be executed.
> So this is part of the context? Why do it at exec level?
It is part of the context, stored within HDC chicken bit register.
The exec level was requested by the OCL team, due to concerns about
performance cost of context setparam calls.
> If exec level
> is desired, why not whitelist it?
> -Chris
If we have no issue in whitelisting the register, I'm sure OCL will
agree to that.
I assumed the whitelisting will be unacceptable because of security
concerns with some options.
The register also changes its position and content between gens, which
makes whitelisting hard to manage.
Main purpose of chicken bit registers, in general, is to allow work
around for hardware features which could be buggy or could have
unintended influence on the platform.
The data port coherency functionality landed there for the same reasons;
then it twisted itself in a way that we now need user space to switch it.
Is it really ok to whitelist chicken bit registers?
-Tomasz
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-19 14:14 ` Lis, Tomasz
@ 2018-03-19 14:26 ` Chris Wilson
2018-03-20 17:23 ` Lis, Tomasz
2018-03-20 18:43 ` Oscar Mateo
1 sibling, 1 reply; 88+ messages in thread
From: Chris Wilson @ 2018-03-19 14:26 UTC (permalink / raw)
To: Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski
Quoting Lis, Tomasz (2018-03-19 14:14:19)
>
>
> On 2018-03-19 13:43, Chris Wilson wrote:
> > Quoting Tomasz Lis (2018-03-19 12:37:35)
> >> The patch adds a parameter to control the data port coherency functionality
> >> on a per-exec call basis. When data port coherency flag value is different
> >> than what it was in previous call for the context, a command to switch data
> >> port coherency state is added before the buffer to be executed.
> > So this is part of the context? Why do it at exec level?
>
> It is part of the context, stored within HDC chicken bit register.
> The exec level was requested by the OCL team, due to concerns about
> performance cost of context setparam calls.
What? Oh dear, oh dear, thrice oh dear. The context setparam would look
like:
if (arg != context->value) {
rq = request_alloc(context, RCS);
cs = ring_begin(rq, 4);
cs++ = MI_LRI;
cs++ = reg;
cs++ = magic;
cs++ = MI_NOOP;
request_add(rq);
context->value = arg
}
The argument is whether stuffing it into a crowded, v.frequently
executed execbuf is better than an irregular setparam. If they want to
flip it on every batch, use execbuf. If it's going to be very
infrequent, setparam.
That discussion must be part of the rationale in the commitlog.
Otoh, execbuf3 would accept it as a command packet. Hmm.
> > If exec level
> > is desired, why not whitelist it?
>
> If we have no issue in whitelisting the register, I'm sure OCL will
> agree to that.
> I assumed the whitelisting will be unacceptable because of security
> concerns with some options.
> The register also changes its position and content between gens, which
> makes whitelisting hard to manage.
>
> Main purpose of chicken bit registers, in general, is to allow work
> around for hardware features which could be buggy or could have
> unintended influence on the platform.
> The data port coherency functionality landed there for the same reasons;
> then it twisted itself in a way that we now need user space to switch it.
> Is it really ok to whitelist chicken bit registers?
It all depends on whether it breaks segregation. If the only users
affected are themselves, fine. Otherwise, no.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-19 14:26 ` Chris Wilson
@ 2018-03-20 17:23 ` Lis, Tomasz
2018-05-04 9:24 ` Joonas Lahtinen
0 siblings, 1 reply; 88+ messages in thread
From: Lis, Tomasz @ 2018-03-20 17:23 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: bartosz.dunajski
[-- Attachment #1.1: Type: text/plain, Size: 4136 bytes --]
On 2018-03-19 15:26, Chris Wilson wrote:
> Quoting Lis, Tomasz (2018-03-19 14:14:19)
>>
>> On 2018-03-19 13:43, Chris Wilson wrote:
>>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>>> The patch adds a parameter to control the data port coherency functionality
>>>> on a per-exec call basis. When data port coherency flag value is different
>>>> than what it was in previous call for the context, a command to switch data
>>>> port coherency state is added before the buffer to be executed.
>>> So this is part of the context? Why do it at exec level?
>> It is part of the context, stored within HDC chicken bit register.
>> The exec level was requested by the OCL team, due to concerns about
>> performance cost of context setparam calls.
> What? Oh dear, oh dear, thrice oh dear. The context setparam would look
> like:
>
> if (arg != context->value) {
> rq = request_alloc(context, RCS);
> cs = ring_begin(rq, 4);
> cs++ = MI_LRI;
> cs++ = reg;
> cs++ = magic;
> cs++ = MI_NOOP;
> request_add(rq);
> context->value = arg
> }
>
> The argument is whether stuffing it into a crowded, v.frequently
> executed execbuf is better than an irregular setparam. If they want to
> flip it on every batch, use execbuf. If it's going to be very
> infrequent, setparam.
Implementing the data port coherency switch as context setparam would
not be a problem, I agree.
But this is not a solution OCL is willing to accept. Any additional
IOCTL call is a concern for the OCL developers.
For more explanation on switch frequency - please look at the cover
letter I provided; here's the related part of it:
(note: the data port coherency is called fine grain coherency within UMD)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
> That discussion must be part of the rationale in the commitlog.
Will add.
Should I place the whole text from cover letter within the commit comment?
> Otoh, execbuf3 would accept it as a command packet. Hmm.
I know we have execbuf2, but execbuf3? Are you proposing to add
something like that?
>>> If exec level
>>> is desired, why not whitelist it?
>> If we have no issue in whitelisting the register, I'm sure OCL will
>> agree to that.
>> I assumed the whitelisting will be unacceptable because of security
>> concerns with some options.
>> The register also changes its position and content between gens, which
>> makes whitelisting hard to manage.
>>
>> Main purpose of chicken bit registers, in general, is to allow work
>> around for hardware features which could be buggy or could have
>> unintended influence on the platform.
>> The data port coherency functionality landed there for the same reasons;
>> then it twisted itself in a way that we now need user space to switch it.
>> Is it really ok to whitelist chicken bit registers?
> It all depends on whether it breaks segregation. If the only users
> affected are themselves, fine. Otherwise, no.
> -Chris
Chicken Bit registers are definitely not planned as safe for use. While
meaning of bits within HDC_CHICKEN0 change between gens, I doubt any of
the registers *can't* be used to cause GPU hung.
-Tomasz
[-- Attachment #1.2: Type: text/html, Size: 5492 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-20 17:23 ` Lis, Tomasz
@ 2018-05-04 9:24 ` Joonas Lahtinen
0 siblings, 0 replies; 88+ messages in thread
From: Joonas Lahtinen @ 2018-05-04 9:24 UTC (permalink / raw)
To: Lis, Tomasz, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski
Quoting Lis, Tomasz (2018-03-20 19:23:03)
>
>
> On 2018-03-19 15:26, Chris Wilson wrote:
>
> Quoting Lis, Tomasz (2018-03-19 14:14:19)
>
>
> On 2018-03-19 13:43, Chris Wilson wrote:
>
> Quoting Tomasz Lis (2018-03-19 12:37:35)
>
> The patch adds a parameter to control the data port coherency functionality
> on a per-exec call basis. When data port coherency flag value is different
> than what it was in previous call for the context, a command to switch data
> port coherency state is added before the buffer to be executed.
>
> So this is part of the context? Why do it at exec level?
>
> It is part of the context, stored within HDC chicken bit register.
> The exec level was requested by the OCL team, due to concerns about
> performance cost of context setparam calls.
>
> What? Oh dear, oh dear, thrice oh dear. The context setparam would look
> like:
>
> if (arg != context->value) {
> rq = request_alloc(context, RCS);
> cs = ring_begin(rq, 4);
> cs++ = MI_LRI;
> cs++ = reg;
> cs++ = magic;
> cs++ = MI_NOOP;
> request_add(rq);
> context->value = arg
> }
>
> The argument is whether stuffing it into a crowded, v.frequently
> executed execbuf is better than an irregular setparam. If they want to
> flip it on every batch, use execbuf. If it's going to be very
> infrequent, setparam.
>
> Implementing the data port coherency switch as context setparam would not be a
> problem, I agree.
> But this is not a solution OCL is willing to accept. Any additional IOCTL call
> is a concern for the OCL developers.
Being part of hardware context is a good indication that GEM context is
the right place for the bit. Stuffing more into execbuf for
one-usecase-one-platform scenario doesn't sound very future looking in
terms of overall driver development.
I would truly imagine that the IOCTL execution time should not be
meaningful compared to compute kernel execution times. If they are
already having a large amount of IOCTL calls between each patch, I guess
that is something to be discussed separately.
Regards, Joonas
>
> For more explanation on switch frequency - please look at the cover letter I
> provided; here's the related part of it:
> (note: the data port coherency is called fine grain coherency within UMD)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> That discussion must be part of the rationale in the commitlog.
>
> Will add.
> Should I place the whole text from cover letter within the commit comment?
>
> Otoh, execbuf3 would accept it as a command packet. Hmm.
>
> I know we have execbuf2, but execbuf3? Are you proposing to add something like
> that?
>
> If exec level
> is desired, why not whitelist it?
>
> If we have no issue in whitelisting the register, I'm sure OCL will
> agree to that.
> I assumed the whitelisting will be unacceptable because of security
> concerns with some options.
> The register also changes its position and content between gens, which
> makes whitelisting hard to manage.
>
> Main purpose of chicken bit registers, in general, is to allow work
> around for hardware features which could be buggy or could have
> unintended influence on the platform.
> The data port coherency functionality landed there for the same reasons;
> then it twisted itself in a way that we now need user space to switch it.
> Is it really ok to whitelist chicken bit registers?
>
> It all depends on whether it breaks segregation. If the only users
> affected are themselves, fine. Otherwise, no.
> -Chris
>
> Chicken Bit registers are definitely not planned as safe for use. While meaning
> of bits within HDC_CHICKEN0 change between gens, I doubt any of the registers
> *can't* be used to cause GPU hung.
> -Tomasz
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-19 14:14 ` Lis, Tomasz
2018-03-19 14:26 ` Chris Wilson
@ 2018-03-20 18:43 ` Oscar Mateo
2018-03-21 10:16 ` Chris Wilson
1 sibling, 1 reply; 88+ messages in thread
From: Oscar Mateo @ 2018-03-20 18:43 UTC (permalink / raw)
To: Lis, Tomasz, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski
On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
>
>
> On 2018-03-19 13:43, Chris Wilson wrote:
>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>> The patch adds a parameter to control the data port coherency
>>> functionality
>>> on a per-exec call basis. When data port coherency flag value is
>>> different
>>> than what it was in previous call for the context, a command to
>>> switch data
>>> port coherency state is added before the buffer to be executed.
>> So this is part of the context? Why do it at exec level?
>
> It is part of the context, stored within HDC chicken bit register.
> The exec level was requested by the OCL team, due to concerns about
> performance cost of context setparam calls.
>
>> If exec level
>> is desired, why not whitelist it?
>> -Chris
>
> If we have no issue in whitelisting the register, I'm sure OCL will
> agree to that.
> I assumed the whitelisting will be unacceptable because of security
> concerns with some options.
> The register also changes its position and content between gens, which
> makes whitelisting hard to manage.
>
I think a security analysis of this register was already done, and the
result was that it contains some other bits that could be dangerous. In
CNL those bits were moved out of the way and the HDC_CHICKEN0 register
can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register
should already be non-privileged.
> Main purpose of chicken bit registers, in general, is to allow work
> around for hardware features which could be buggy or could have
> unintended influence on the platform.
> The data port coherency functionality landed there for the same
> reasons; then it twisted itself in a way that we now need user space
> to switch it.
> Is it really ok to whitelist chicken bit registers?
> -Tomasz
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-20 18:43 ` Oscar Mateo
@ 2018-03-21 10:16 ` Chris Wilson
2018-03-21 19:42 ` Oscar Mateo
0 siblings, 1 reply; 88+ messages in thread
From: Chris Wilson @ 2018-03-21 10:16 UTC (permalink / raw)
To: Oscar Mateo, Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski
Quoting Oscar Mateo (2018-03-20 18:43:45)
>
>
> On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
> >
> >
> > On 2018-03-19 13:43, Chris Wilson wrote:
> >> Quoting Tomasz Lis (2018-03-19 12:37:35)
> >>> The patch adds a parameter to control the data port coherency
> >>> functionality
> >>> on a per-exec call basis. When data port coherency flag value is
> >>> different
> >>> than what it was in previous call for the context, a command to
> >>> switch data
> >>> port coherency state is added before the buffer to be executed.
> >> So this is part of the context? Why do it at exec level?
> >
> > It is part of the context, stored within HDC chicken bit register.
> > The exec level was requested by the OCL team, due to concerns about
> > performance cost of context setparam calls.
> >
> >> If exec level
> >> is desired, why not whitelist it?
> >> -Chris
> >
> > If we have no issue in whitelisting the register, I'm sure OCL will
> > agree to that.
> > I assumed the whitelisting will be unacceptable because of security
> > concerns with some options.
> > The register also changes its position and content between gens, which
> > makes whitelisting hard to manage.
> >
>
> I think a security analysis of this register was already done, and the
> result was that it contains some other bits that could be dangerous. In
> CNL those bits were moved out of the way and the HDC_CHICKEN0 register
> can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register
> should already be non-privileged.
The previous alternative to whitelisting was running through a command
parser for validation. That's a very general mechanism suitable for a
wide range of sins.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-21 10:16 ` Chris Wilson
@ 2018-03-21 19:42 ` Oscar Mateo
2018-03-27 17:41 ` Lis, Tomasz
0 siblings, 1 reply; 88+ messages in thread
From: Oscar Mateo @ 2018-03-21 19:42 UTC (permalink / raw)
To: Chris Wilson, Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski
On 3/21/2018 3:16 AM, Chris Wilson wrote:
> Quoting Oscar Mateo (2018-03-20 18:43:45)
>>
>> On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
>>>
>>> On 2018-03-19 13:43, Chris Wilson wrote:
>>>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>>>> The patch adds a parameter to control the data port coherency
>>>>> functionality
>>>>> on a per-exec call basis. When data port coherency flag value is
>>>>> different
>>>>> than what it was in previous call for the context, a command to
>>>>> switch data
>>>>> port coherency state is added before the buffer to be executed.
>>>> So this is part of the context? Why do it at exec level?
>>> It is part of the context, stored within HDC chicken bit register.
>>> The exec level was requested by the OCL team, due to concerns about
>>> performance cost of context setparam calls.
>>>
>>>> If exec level
>>>> is desired, why not whitelist it?
>>>> -Chris
>>> If we have no issue in whitelisting the register, I'm sure OCL will
>>> agree to that.
>>> I assumed the whitelisting will be unacceptable because of security
>>> concerns with some options.
>>> The register also changes its position and content between gens, which
>>> makes whitelisting hard to manage.
>>>
>> I think a security analysis of this register was already done, and the
>> result was that it contains some other bits that could be dangerous. In
>> CNL those bits were moved out of the way and the HDC_CHICKEN0 register
>> can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register
>> should already be non-privileged.
> The previous alternative to whitelisting was running through a command
> parser for validation. That's a very general mechanism suitable for a
> wide range of sins.
> -Chris
Are you suggesting that we enable the cmd parser for every Gen < CNL for
this particular usage only? :P
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [RFC v1] drm/i915: Add Exec param to control data port coherency.
2018-03-21 19:42 ` Oscar Mateo
@ 2018-03-27 17:41 ` Lis, Tomasz
0 siblings, 0 replies; 88+ messages in thread
From: Lis, Tomasz @ 2018-03-27 17:41 UTC (permalink / raw)
To: Oscar Mateo, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski
On 2018-03-21 20:42, Oscar Mateo wrote:
>
>
> On 3/21/2018 3:16 AM, Chris Wilson wrote:
>> Quoting Oscar Mateo (2018-03-20 18:43:45)
>>>
>>> On 3/19/2018 7:14 AM, Lis, Tomasz wrote:
>>>>
>>>> On 2018-03-19 13:43, Chris Wilson wrote:
>>>>> Quoting Tomasz Lis (2018-03-19 12:37:35)
>>>>>> The patch adds a parameter to control the data port coherency
>>>>>> functionality
>>>>>> on a per-exec call basis. When data port coherency flag value is
>>>>>> different
>>>>>> than what it was in previous call for the context, a command to
>>>>>> switch data
>>>>>> port coherency state is added before the buffer to be executed.
>>>>> So this is part of the context? Why do it at exec level?
>>>> It is part of the context, stored within HDC chicken bit register.
>>>> The exec level was requested by the OCL team, due to concerns about
>>>> performance cost of context setparam calls.
>>>>
>>>>> If exec level
>>>>> is desired, why not whitelist it?
>>>>> -Chris
>>>> If we have no issue in whitelisting the register, I'm sure OCL will
>>>> agree to that.
>>>> I assumed the whitelisting will be unacceptable because of security
>>>> concerns with some options.
>>>> The register also changes its position and content between gens, which
>>>> makes whitelisting hard to manage.
>>>>
>>> I think a security analysis of this register was already done, and the
>>> result was that it contains some other bits that could be dangerous. In
>>> CNL those bits were moved out of the way and the HDC_CHICKEN0 register
>>> can be whitelisted (WaAllowUMDToControlCoherency). In ICL the register
>>> should already be non-privileged.
>> The previous alternative to whitelisting was running through a command
>> parser for validation. That's a very general mechanism suitable for a
>> wide range of sins.
>> -Chris
>
> Are you suggesting that we enable the cmd parser for every Gen < CNL
> for this particular usage only? :P
>
It is a solution that would allow to do what we want without any
additions to module interface.
It may be worth considering if we think the coherency setting will be
temporary and removed in future gens, as we wouldn't want to have
obsolete flags.
I think the setting will stay with us, as it is needed to support
CL_MEM_SVM_FINE_GRAIN_BUFFER flag from OpenCL spec.
Keeping coherency will always cost performance, so we will likely always
have a hardware setting to switch it.
But the bspec says coherency override control will be removed in future
projects...
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* [PATCH v1] drm/i915: Add Exec param to control data port coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
2018-03-19 12:43 ` Chris Wilson
@ 2018-03-30 17:29 ` Tomasz Lis
2018-03-31 19:07 ` kbuild test robot
2018-04-11 15:46 ` [PATCH v2] " Tomasz Lis
` (5 subsequent siblings)
7 siblings, 1 reply; 88+ messages in thread
From: Tomasz Lis @ 2018-03-30 17:29 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The patch adds a parameter to control the data port coherency functionality
on a per-exec call basis. When data port coherency flag value is different
than what it was in previous call for the context, a command to switch data
port coherency state is added before the buffer to be executed.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).
When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
4. Why the execlist flag approach was chosen?
There are two other ways of providing the functionality to UMDs, besides the
execlist flag:
a) Chicken bit register whitelisting.
This approach would allow adding the functionality without any change to
KMDs interface. Also, it has been determined that whitelisting is safe for
gen10 and gen11. The issue is with gen9, where hardware whitelisting cannot
be used, and OCL driver needs support for it. A workaround there would be to
use command parser, which verifies buffers before execution. But such parsing
comes at a considerable performance cost.
b) Providing the flag as context IOCTL setting.
The data port coherency switch could be implemented as a context parameter,
which would schedule submission of a buffer to switch the coherency flag.
That is an elegant solution with bounds the flag to context, which matches
the hardware placement of the feature. This solution was not accepted
because of OCL driver performance concerns. The OCL driver is constructed
with emphasis on creating small, but very frequent submissions. With such
architecture, adding IOCTL setparam call before submission has considerable
impact on the performance.
Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
drivers/gpu/drm/i915/i915_drv.c | 3 ++
drivers/gpu/drm/i915/i915_gem_context.h | 1 +
drivers/gpu/drm/i915/i915_gem_execbuffer.c | 17 ++++++++++
drivers/gpu/drm/i915/intel_lrc.c | 53 ++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.h | 3 ++
include/uapi/drm/i915_drm.h | 12 ++++++-
6 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d354627..030854e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -436,6 +436,9 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data,
case I915_PARAM_CS_TIMESTAMP_FREQUENCY:
value = 1000 * INTEL_INFO(dev_priv)->cs_timestamp_frequency_khz;
break;
+ case I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY:
+ value = (INTEL_GEN(dev_priv) >= 9);
+ break;
default:
DRM_DEBUG("Unknown parameter %d\n", param->param);
return -EINVAL;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 7854262..00aa309 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -118,6 +118,7 @@ struct i915_gem_context {
#define CONTEXT_BANNABLE 3
#define CONTEXT_BANNED 4
#define CONTEXT_FORCE_SINGLE_SUBMISSION 5
+#define CONTEXT_DATA_PORT_COHERENT 6
/**
* @hw_id: - unique identifier for the context
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 8c170db..e3a2f9e 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2245,6 +2245,18 @@ i915_gem_do_execbuffer(struct drm_device *dev,
eb.batch_flags |= I915_DISPATCH_RS;
}
+ if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
+ if (INTEL_GEN(eb.i915) < 9) {
+ DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
+ return -EINVAL;
+ }
+ if (eb.engine->class != RENDER_CLASS) {
+ DRM_DEBUG("Data Port Coherency is not available on %s\n",
+ eb.engine->name);
+ return -EINVAL;
+ }
+ }
+
if (args->flags & I915_EXEC_FENCE_IN) {
in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
if (!in_fence)
@@ -2371,6 +2383,11 @@ i915_gem_do_execbuffer(struct drm_device *dev,
goto err_batch_unpin;
}
+ /* Emit the switch of data port coherency state if needed */
+ err = intel_lr_context_modify_data_port_coherency(eb.request,
+ (args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
+ GEM_WARN_ON(err);
+
if (in_fence) {
err = i915_request_await_dma_fence(eb.request, in_fence);
if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index f60b61b..2094494 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -254,6 +254,59 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
ce->lrc_desc = desc;
}
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+ u32 *cs;
+ i915_reg_t reg;
+
+ GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+ GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+ cs = intel_ring_begin(req, 4);
+ if (IS_ERR(cs))
+ return PTR_ERR(cs);
+
+ if (INTEL_GEN(req->i915) >= 10)
+ reg = CNL_HDC_CHICKEN0;
+ else
+ reg = HDC_CHICKEN0;
+
+ *cs++ = MI_LOAD_REGISTER_IMM(1);
+ *cs++ = i915_mmio_reg_offset(reg);
+ /* Enabling coherency means disabling the bit which forces it off */
+ if (enable)
+ *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+ else
+ *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+ *cs++ = MI_NOOP;
+
+ intel_ring_advance(req, cs);
+
+ return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+ bool enable)
+{
+ struct i915_gem_context *ctx = req->ctx;
+ int ret;
+
+ if (test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags) == enable)
+ return 0;
+
+ ret = emit_set_data_port_coherency(req, enable);
+
+ if (!ret) {
+ if (enable)
+ __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ else
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ }
+
+ return ret;
+}
+
static struct i915_priolist *
lookup_priolist(struct intel_engine_cs *engine,
struct i915_priotree *pt,
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 59d7b86..c46b239 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -111,4 +111,7 @@ intel_lr_context_descriptor(struct i915_gem_context *ctx,
return ctx->engine[engine->id].lrc_desc;
}
+int intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+ bool enable);
+
#endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..0f52793 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -529,6 +529,11 @@ typedef struct drm_i915_irq_wait {
*/
#define I915_PARAM_CS_TIMESTAMP_FREQUENCY 51
+/* Query whether DRM_I915_GEM_EXECBUFFER2 supports the ability to switch
+ * Data Cache access into Data Port Coherency mode.
+ */
+#define I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY 52
+
typedef struct drm_i915_getparam {
__s32 param;
/*
@@ -1048,7 +1053,12 @@ struct drm_i915_gem_execbuffer2 {
*/
#define I915_EXEC_FENCE_ARRAY (1<<19)
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_ARRAY<<1))
+/* Data Port Coherency capability will be switched before an exec call
+ * which has this flag different than previous call for the context.
+ */
+#define I915_EXEC_DATA_PORT_COHERENT (1<<20)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
#define I915_EXEC_CONTEXT_ID_MASK (0xffffffff)
#define i915_execbuffer2_set_context_id(eb2, context) \
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add Exec param to control data port coherency.
2018-03-30 17:29 ` [PATCH " Tomasz Lis
@ 2018-03-31 19:07 ` kbuild test robot
0 siblings, 0 replies; 88+ messages in thread
From: kbuild test robot @ 2018-03-31 19:07 UTC (permalink / raw)
To: Tomasz Lis; +Cc: intel-gfx, bartosz.dunajski, kbuild-all
[-- Attachment #1: Type: text/plain, Size: 11050 bytes --]
Hi Tomasz,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Tomasz-Lis/drm-i915-Add-Exec-param-to-control-data-port-coherency/20180401-021313
base: git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-x010-201813 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All warnings (new ones prefixed by >>):
In file included from drivers/gpu//drm/i915/i915_request.h:30:0,
from drivers/gpu//drm/i915/i915_gem_timeline.h:30,
from drivers/gpu//drm/i915/intel_ringbuffer.h:8,
from drivers/gpu//drm/i915/intel_lrc.h:27,
from drivers/gpu//drm/i915/i915_drv.h:63,
from drivers/gpu//drm/i915/i915_gem_execbuffer.c:38:
drivers/gpu//drm/i915/i915_gem_execbuffer.c: In function 'i915_gem_do_execbuffer':
drivers/gpu//drm/i915/i915_gem.h:47:54: warning: statement with no effect [-Wunused-value]
#define GEM_WARN_ON(expr) (BUILD_BUG_ON_INVALID(expr), 0)
~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
>> drivers/gpu//drm/i915/i915_gem_execbuffer.c:2389:2: note: in expansion of macro 'GEM_WARN_ON'
GEM_WARN_ON(err);
^~~~~~~~~~~
vim +/GEM_WARN_ON +2389 drivers/gpu//drm/i915/i915_gem_execbuffer.c
2182
2183 static int
2184 i915_gem_do_execbuffer(struct drm_device *dev,
2185 struct drm_file *file,
2186 struct drm_i915_gem_execbuffer2 *args,
2187 struct drm_i915_gem_exec_object2 *exec,
2188 struct drm_syncobj **fences)
2189 {
2190 struct i915_execbuffer eb;
2191 struct dma_fence *in_fence = NULL;
2192 struct sync_file *out_fence = NULL;
2193 int out_fence_fd = -1;
2194 int err;
2195
2196 BUILD_BUG_ON(__EXEC_INTERNAL_FLAGS & ~__I915_EXEC_ILLEGAL_FLAGS);
2197 BUILD_BUG_ON(__EXEC_OBJECT_INTERNAL_FLAGS &
2198 ~__EXEC_OBJECT_UNKNOWN_FLAGS);
2199
2200 eb.i915 = to_i915(dev);
2201 eb.file = file;
2202 eb.args = args;
2203 if (DBG_FORCE_RELOC || !(args->flags & I915_EXEC_NO_RELOC))
2204 args->flags |= __EXEC_HAS_RELOC;
2205
2206 eb.exec = exec;
2207 eb.vma = (struct i915_vma **)(exec + args->buffer_count + 1);
2208 eb.vma[0] = NULL;
2209 eb.flags = (unsigned int *)(eb.vma + args->buffer_count + 1);
2210
2211 eb.invalid_flags = __EXEC_OBJECT_UNKNOWN_FLAGS;
2212 if (USES_FULL_PPGTT(eb.i915))
2213 eb.invalid_flags |= EXEC_OBJECT_NEEDS_GTT;
2214 reloc_cache_init(&eb.reloc_cache, eb.i915);
2215
2216 eb.buffer_count = args->buffer_count;
2217 eb.batch_start_offset = args->batch_start_offset;
2218 eb.batch_len = args->batch_len;
2219
2220 eb.batch_flags = 0;
2221 if (args->flags & I915_EXEC_SECURE) {
2222 if (!drm_is_current_master(file) || !capable(CAP_SYS_ADMIN))
2223 return -EPERM;
2224
2225 eb.batch_flags |= I915_DISPATCH_SECURE;
2226 }
2227 if (args->flags & I915_EXEC_IS_PINNED)
2228 eb.batch_flags |= I915_DISPATCH_PINNED;
2229
2230 eb.engine = eb_select_engine(eb.i915, file, args);
2231 if (!eb.engine)
2232 return -EINVAL;
2233
2234 if (args->flags & I915_EXEC_RESOURCE_STREAMER) {
2235 if (!HAS_RESOURCE_STREAMER(eb.i915)) {
2236 DRM_DEBUG("RS is only allowed for Haswell, Gen8 and above\n");
2237 return -EINVAL;
2238 }
2239 if (eb.engine->id != RCS) {
2240 DRM_DEBUG("RS is not available on %s\n",
2241 eb.engine->name);
2242 return -EINVAL;
2243 }
2244
2245 eb.batch_flags |= I915_DISPATCH_RS;
2246 }
2247
2248 if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
2249 if (INTEL_GEN(eb.i915) < 9) {
2250 DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
2251 return -EINVAL;
2252 }
2253 if (eb.engine->class != RENDER_CLASS) {
2254 DRM_DEBUG("Data Port Coherency is not available on %s\n",
2255 eb.engine->name);
2256 return -EINVAL;
2257 }
2258 }
2259
2260 if (args->flags & I915_EXEC_FENCE_IN) {
2261 in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
2262 if (!in_fence)
2263 return -EINVAL;
2264 }
2265
2266 if (args->flags & I915_EXEC_FENCE_OUT) {
2267 out_fence_fd = get_unused_fd_flags(O_CLOEXEC);
2268 if (out_fence_fd < 0) {
2269 err = out_fence_fd;
2270 goto err_in_fence;
2271 }
2272 }
2273
2274 err = eb_create(&eb);
2275 if (err)
2276 goto err_out_fence;
2277
2278 GEM_BUG_ON(!eb.lut_size);
2279
2280 err = eb_select_context(&eb);
2281 if (unlikely(err))
2282 goto err_destroy;
2283
2284 /*
2285 * Take a local wakeref for preparing to dispatch the execbuf as
2286 * we expect to access the hardware fairly frequently in the
2287 * process. Upon first dispatch, we acquire another prolonged
2288 * wakeref that we hold until the GPU has been idle for at least
2289 * 100ms.
2290 */
2291 intel_runtime_pm_get(eb.i915);
2292
2293 err = i915_mutex_lock_interruptible(dev);
2294 if (err)
2295 goto err_rpm;
2296
2297 err = eb_relocate(&eb);
2298 if (err) {
2299 /*
2300 * If the user expects the execobject.offset and
2301 * reloc.presumed_offset to be an exact match,
2302 * as for using NO_RELOC, then we cannot update
2303 * the execobject.offset until we have completed
2304 * relocation.
2305 */
2306 args->flags &= ~__EXEC_HAS_RELOC;
2307 goto err_vma;
2308 }
2309
2310 if (unlikely(*eb.batch->exec_flags & EXEC_OBJECT_WRITE)) {
2311 DRM_DEBUG("Attempting to use self-modifying batch buffer\n");
2312 err = -EINVAL;
2313 goto err_vma;
2314 }
2315 if (eb.batch_start_offset > eb.batch->size ||
2316 eb.batch_len > eb.batch->size - eb.batch_start_offset) {
2317 DRM_DEBUG("Attempting to use out-of-bounds batch\n");
2318 err = -EINVAL;
2319 goto err_vma;
2320 }
2321
2322 if (eb_use_cmdparser(&eb)) {
2323 struct i915_vma *vma;
2324
2325 vma = eb_parse(&eb, drm_is_current_master(file));
2326 if (IS_ERR(vma)) {
2327 err = PTR_ERR(vma);
2328 goto err_vma;
2329 }
2330
2331 if (vma) {
2332 /*
2333 * Batch parsed and accepted:
2334 *
2335 * Set the DISPATCH_SECURE bit to remove the NON_SECURE
2336 * bit from MI_BATCH_BUFFER_START commands issued in
2337 * the dispatch_execbuffer implementations. We
2338 * specifically don't want that set on batches the
2339 * command parser has accepted.
2340 */
2341 eb.batch_flags |= I915_DISPATCH_SECURE;
2342 eb.batch_start_offset = 0;
2343 eb.batch = vma;
2344 }
2345 }
2346
2347 if (eb.batch_len == 0)
2348 eb.batch_len = eb.batch->size - eb.batch_start_offset;
2349
2350 /*
2351 * snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
2352 * batch" bit. Hence we need to pin secure batches into the global gtt.
2353 * hsw should have this fixed, but bdw mucks it up again. */
2354 if (eb.batch_flags & I915_DISPATCH_SECURE) {
2355 struct i915_vma *vma;
2356
2357 /*
2358 * So on first glance it looks freaky that we pin the batch here
2359 * outside of the reservation loop. But:
2360 * - The batch is already pinned into the relevant ppgtt, so we
2361 * already have the backing storage fully allocated.
2362 * - No other BO uses the global gtt (well contexts, but meh),
2363 * so we don't really have issues with multiple objects not
2364 * fitting due to fragmentation.
2365 * So this is actually safe.
2366 */
2367 vma = i915_gem_object_ggtt_pin(eb.batch->obj, NULL, 0, 0, 0);
2368 if (IS_ERR(vma)) {
2369 err = PTR_ERR(vma);
2370 goto err_vma;
2371 }
2372
2373 eb.batch = vma;
2374 }
2375
2376 /* All GPU relocation batches must be submitted prior to the user rq */
2377 GEM_BUG_ON(eb.reloc_cache.rq);
2378
2379 /* Allocate a request for this batch buffer nice and early. */
2380 eb.request = i915_request_alloc(eb.engine, eb.ctx);
2381 if (IS_ERR(eb.request)) {
2382 err = PTR_ERR(eb.request);
2383 goto err_batch_unpin;
2384 }
2385
2386 /* Emit the switch of data port coherency state if needed */
2387 err = intel_lr_context_modify_data_port_coherency(eb.request,
2388 (args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
> 2389 GEM_WARN_ON(err);
2390
2391 if (in_fence) {
2392 err = i915_request_await_dma_fence(eb.request, in_fence);
2393 if (err < 0)
2394 goto err_request;
2395 }
2396
2397 if (fences) {
2398 err = await_fence_array(&eb, fences);
2399 if (err)
2400 goto err_request;
2401 }
2402
2403 if (out_fence_fd != -1) {
2404 out_fence = sync_file_create(&eb.request->fence);
2405 if (!out_fence) {
2406 err = -ENOMEM;
2407 goto err_request;
2408 }
2409 }
2410
2411 /*
2412 * Whilst this request exists, batch_obj will be on the
2413 * active_list, and so will hold the active reference. Only when this
2414 * request is retired will the the batch_obj be moved onto the
2415 * inactive_list and lose its active reference. Hence we do not need
2416 * to explicitly hold another reference here.
2417 */
2418 eb.request->batch = eb.batch;
2419
2420 trace_i915_request_queue(eb.request, eb.batch_flags);
2421 err = eb_submit(&eb);
2422 err_request:
2423 __i915_request_add(eb.request, err == 0);
2424 add_to_client(eb.request, file);
2425
2426 if (fences)
2427 signal_fence_array(&eb, fences);
2428
2429 if (out_fence) {
2430 if (err == 0) {
2431 fd_install(out_fence_fd, out_fence->file);
2432 args->rsvd2 &= GENMASK_ULL(31, 0); /* keep in-fence */
2433 args->rsvd2 |= (u64)out_fence_fd << 32;
2434 out_fence_fd = -1;
2435 } else {
2436 fput(out_fence->file);
2437 }
2438 }
2439
2440 err_batch_unpin:
2441 if (eb.batch_flags & I915_DISPATCH_SECURE)
2442 i915_vma_unpin(eb.batch);
2443 err_vma:
2444 if (eb.exec)
2445 eb_release_vmas(&eb);
2446 mutex_unlock(&dev->struct_mutex);
2447 err_rpm:
2448 intel_runtime_pm_put(eb.i915);
2449 i915_gem_context_put(eb.ctx);
2450 err_destroy:
2451 eb_destroy(&eb);
2452 err_out_fence:
2453 if (out_fence_fd != -1)
2454 put_unused_fd(out_fence_fd);
2455 err_in_fence:
2456 dma_fence_put(in_fence);
2457 return err;
2458 }
2459
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31427 bytes --]
[-- Attachment #3: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* [PATCH v2] drm/i915: Add Exec param to control data port coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
2018-03-19 12:43 ` Chris Wilson
2018-03-30 17:29 ` [PATCH " Tomasz Lis
@ 2018-04-11 15:46 ` Tomasz Lis
2018-06-20 15:03 ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
` (4 subsequent siblings)
7 siblings, 0 replies; 88+ messages in thread
From: Tomasz Lis @ 2018-04-11 15:46 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The patch adds a parameter to control the data port coherency functionality
on a per-exec call basis. When data port coherency flag value is different
than what it was in previous call for the context, a command to switch data
port coherency state is added before the buffer to be executed.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).
When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
4. Why the execlist flag approach was chosen?
There are two other ways of providing the functionality to UMDs, besides the
execlist flag:
a) Chicken bit register whitelisting.
This approach would allow adding the functionality without any change to
KMDs interface. Also, it has been determined that whitelisting is safe for
gen10 and gen11. The issue is with gen9, where hardware whitelisting cannot
be used, and OCL driver needs support for it. A workaround there would be to
use command parser, which verifies buffers before execution. But such parsing
comes at a considerable performance cost.
b) Providing the flag as context IOCTL setting.
The data port coherency switch could be implemented as a context parameter,
which would schedule submission of a buffer to switch the coherency flag.
That is an elegant solution with bounds the flag to context, which matches
the hardware placement of the feature. This solution was not accepted
because of OCL driver performance concerns. The OCL driver is constructed
with emphasis on creating small, but very frequent submissions. With such
architecture, adding IOCTL setparam call before submission has considerable
impact on the performance.
v2: Fixed compilation warning.
Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
drivers/gpu/drm/i915/i915_drv.c | 3 ++
drivers/gpu/drm/i915/i915_gem_context.h | 1 +
drivers/gpu/drm/i915/i915_gem_execbuffer.c | 18 ++++++++++
drivers/gpu/drm/i915/intel_lrc.c | 53 ++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.h | 3 ++
include/uapi/drm/i915_drm.h | 12 ++++++-
6 files changed, 89 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index f770be1..19493b0 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -436,6 +436,9 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data,
case I915_PARAM_CS_TIMESTAMP_FREQUENCY:
value = 1000 * INTEL_INFO(dev_priv)->cs_timestamp_frequency_khz;
break;
+ case I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY:
+ value = (INTEL_GEN(dev_priv) >= 9);
+ break;
default:
DRM_DEBUG("Unknown parameter %d\n", param->param);
return -EINVAL;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 7854262..00aa309 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -118,6 +118,7 @@ struct i915_gem_context {
#define CONTEXT_BANNABLE 3
#define CONTEXT_BANNED 4
#define CONTEXT_FORCE_SINGLE_SUBMISSION 5
+#define CONTEXT_DATA_PORT_COHERENT 6
/**
* @hw_id: - unique identifier for the context
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index c74f5df..ada376c 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2274,6 +2274,18 @@ i915_gem_do_execbuffer(struct drm_device *dev,
eb.batch_flags |= I915_DISPATCH_RS;
}
+ if (args->flags & I915_EXEC_DATA_PORT_COHERENT) {
+ if (INTEL_GEN(eb.i915) < 9) {
+ DRM_DEBUG("Data Port Coherency is only allowed for Gen9 and above\n");
+ return -EINVAL;
+ }
+ if (eb.engine->class != RENDER_CLASS) {
+ DRM_DEBUG("Data Port Coherency is not available on %s\n",
+ eb.engine->name);
+ return -EINVAL;
+ }
+ }
+
if (args->flags & I915_EXEC_FENCE_IN) {
in_fence = sync_file_get_fence(lower_32_bits(args->rsvd2));
if (!in_fence)
@@ -2400,6 +2412,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
goto err_batch_unpin;
}
+ /* Emit the switch of data port coherency state if needed */
+ err = intel_lr_context_modify_data_port_coherency(eb.request,
+ (args->flags & I915_EXEC_DATA_PORT_COHERENT) != 0);
+ if (GEM_WARN_ON(err))
+ DRM_DEBUG("Data Port Coherency toggle failed, keeping old setting.\n");
+
if (in_fence) {
err = i915_request_await_dma_fence(eb.request, in_fence);
if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 02b25bf..b25df7e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -255,6 +255,59 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
ce->lrc_desc = desc;
}
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+ u32 *cs;
+ i915_reg_t reg;
+
+ GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+ GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+ cs = intel_ring_begin(req, 4);
+ if (IS_ERR(cs))
+ return PTR_ERR(cs);
+
+ if (INTEL_GEN(req->i915) >= 10)
+ reg = CNL_HDC_CHICKEN0;
+ else
+ reg = HDC_CHICKEN0;
+
+ *cs++ = MI_LOAD_REGISTER_IMM(1);
+ *cs++ = i915_mmio_reg_offset(reg);
+ /* Enabling coherency means disabling the bit which forces it off */
+ if (enable)
+ *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+ else
+ *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+ *cs++ = MI_NOOP;
+
+ intel_ring_advance(req, cs);
+
+ return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+ bool enable)
+{
+ struct i915_gem_context *ctx = req->ctx;
+ int ret;
+
+ if (test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags) == enable)
+ return 0;
+
+ ret = emit_set_data_port_coherency(req, enable);
+
+ if (!ret) {
+ if (enable)
+ __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ else
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ }
+
+ return ret;
+}
+
static struct i915_priolist *
lookup_priolist(struct intel_engine_cs *engine,
struct i915_priotree *pt,
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 59d7b86..c46b239 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -111,4 +111,7 @@ intel_lr_context_descriptor(struct i915_gem_context *ctx,
return ctx->engine[engine->id].lrc_desc;
}
+int intel_lr_context_modify_data_port_coherency(struct i915_request *req,
+ bool enable);
+
#endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..0f52793 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -529,6 +529,11 @@ typedef struct drm_i915_irq_wait {
*/
#define I915_PARAM_CS_TIMESTAMP_FREQUENCY 51
+/* Query whether DRM_I915_GEM_EXECBUFFER2 supports the ability to switch
+ * Data Cache access into Data Port Coherency mode.
+ */
+#define I915_PARAM_HAS_EXEC_DATA_PORT_COHERENCY 52
+
typedef struct drm_i915_getparam {
__s32 param;
/*
@@ -1048,7 +1053,12 @@ struct drm_i915_gem_execbuffer2 {
*/
#define I915_EXEC_FENCE_ARRAY (1<<19)
-#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_ARRAY<<1))
+/* Data Port Coherency capability will be switched before an exec call
+ * which has this flag different than previous call for the context.
+ */
+#define I915_EXEC_DATA_PORT_COHERENT (1<<20)
+
+#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_DATA_PORT_COHERENT<<1))
#define I915_EXEC_CONTEXT_ID_MASK (0xffffffff)
#define i915_execbuffer2_set_context_id(eb2, context) \
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 88+ messages in thread
* [PATCH v1] Second implementation of Data Port Coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
` (2 preceding siblings ...)
2018-04-11 15:46 ` [PATCH v2] " Tomasz Lis
@ 2018-06-20 15:03 ` Tomasz Lis
2018-06-20 15:03 ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
2018-07-09 13:20 ` [PATCH v4] " Tomasz Lis
` (3 subsequent siblings)
7 siblings, 1 reply; 88+ messages in thread
From: Tomasz Lis @ 2018-06-20 15:03 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The OCL Team agreed to use IOCTL instead of Exec flag to switch coherency
settings.
Also:
1. I will follow this patch with IGT test for the new functionality.
2. The OCL Team will publish UMD patch for it.
Tomasz Lis (1):
drm/i915: Add IOCTL Param to control data port coherency.
drivers/gpu/drm/i915/i915_gem_context.c | 41 ++++++++++++++++++++++++++
drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
drivers/gpu/drm/i915/intel_lrc.c | 51 +++++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.h | 4 +++
include/uapi/drm/i915_drm.h | 1 +
5 files changed, 103 insertions(+)
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-20 15:03 ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
@ 2018-06-20 15:03 ` Tomasz Lis
2018-06-21 6:39 ` Joonas Lahtinen
` (2 more replies)
0 siblings, 3 replies; 88+ messages in thread
From: Tomasz Lis @ 2018-06-20 15:03 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).
When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
drivers/gpu/drm/i915/i915_gem_context.c | 41 ++++++++++++++++++++++++++
drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
drivers/gpu/drm/i915/intel_lrc.c | 51 +++++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.h | 4 +++
include/uapi/drm/i915_drm.h | 1 +
5 files changed, 103 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index ccf463a..ea65ae6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
}
+static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+ int ret;
+ ret = intel_lr_context_modify_data_port_coherency(ctx, true);
+ if (!GEM_WARN_ON(ret))
+ __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ return ret;
+}
+
+static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+ int ret;
+ ret = intel_lr_context_modify_data_port_coherency(ctx, false);
+ if (!GEM_WARN_ON(ret))
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ return ret;
+}
+
int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
@@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *dev_priv = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_PRIORITY:
args->value = ctx->sched.priority;
break;
+ case I915_CONTEXT_PARAM_COHERENCY:
+ /*
+ * ENODEV if the feature is not supported. This removes the need
+ * of separate IS_SUPPORTED parameter.
+ */
+ if (INTEL_GEN(dev_priv) < 9)
+ ret = -ENODEV;
+ else
+ args->value = i915_gem_context_is_data_port_coherent(ctx);
+ break;
default:
ret = -EINVAL;
break;
@@ -830,6 +859,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *dev_priv = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
}
break;
+ case I915_CONTEXT_PARAM_COHERENCY:
+ if (args->size)
+ ret = -EINVAL;
+ else if (INTEL_GEN(dev_priv) < 9)
+ ret = -ENODEV;
+ else if (args->value)
+ ret = i915_gem_context_set_data_port_coherent(ctx);
+ else
+ ret = i915_gem_context_clear_data_port_coherent(ctx);
+ break;
+
default:
ret = -EINVAL;
break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..e8ccb70 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,7 @@ struct i915_gem_context {
#define CONTEXT_BANNABLE 3
#define CONTEXT_BANNED 4
#define CONTEXT_FORCE_SINGLE_SUBMISSION 5
+#define CONTEXT_DATA_PORT_COHERENT 6
/**
* @hw_id: - unique identifier for the context
@@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
}
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+ return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+}
+
static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
{
return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 33bc914..c69dc26 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
ce->lrc_desc = desc;
}
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+ u32 *cs;
+ i915_reg_t reg;
+
+ GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+ GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+ cs = intel_ring_begin(req, 4);
+ if (IS_ERR(cs))
+ return PTR_ERR(cs);
+
+ if (INTEL_GEN(req->i915) >= 10)
+ reg = CNL_HDC_CHICKEN0;
+ else
+ reg = HDC_CHICKEN0;
+
+ /* FIXME: this feature may be unuseable on CNL; If this checks to be
+ * true, we should enodev for CNL. */
+ *cs++ = MI_LOAD_REGISTER_IMM(1);
+ *cs++ = i915_mmio_reg_offset(reg);
+ /* Enabling coherency means disabling the bit which forces it off */
+ if (enable)
+ *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+ else
+ *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+ *cs++ = MI_NOOP;
+
+ intel_ring_advance(req, cs);
+
+ return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+ bool enable)
+{
+ struct i915_request *req;
+ int ret;
+
+ req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ ret = emit_set_data_port_coherency(req, enable);
+
+ i915_request_add(req);
+
+ return ret;
+}
+
static struct i915_priolist *
lookup_priolist(struct intel_engine_cs *engine, int prio)
{
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..214e291 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@ struct i915_gem_context;
void intel_lr_context_resume(struct drm_i915_private *dev_priv);
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+ bool enable);
+
#endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..fab072f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
#define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE 0x4
#define I915_CONTEXT_PARAM_BANNABLE 0x5
#define I915_CONTEXT_PARAM_PRIORITY 0x6
+#define I915_CONTEXT_PARAM_COHERENCY 0x7
#define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
#define I915_CONTEXT_DEFAULT_PRIORITY 0
#define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-20 15:03 ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
@ 2018-06-21 6:39 ` Joonas Lahtinen
2018-06-21 13:47 ` Lis, Tomasz
2018-06-21 7:05 ` Chris Wilson
2018-06-21 7:31 ` Dunajski, Bartosz
2 siblings, 1 reply; 88+ messages in thread
From: Joonas Lahtinen @ 2018-06-21 6:39 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
Changelog would be much appreciated. And this is not the first version
of the series. It helps to remind the reviewer that original
implementation was changed into IOCTl based on feedback. Please see the
git log in i915 for some examples.
Quoting Tomasz Lis (2018-06-20 18:03:07)
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
<SNIP>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index ccf463a..ea65ae6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
> return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
> }
>
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> + if (!GEM_WARN_ON(ret))
I don't think there's need for the WARN as the error will be propagated
back to userspace?
> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> + if (!GEM_WARN_ON(ret))
Ditto.
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
> +}
> +
> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> @@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *dev_priv = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority;
> break;
> + case I915_CONTEXT_PARAM_COHERENCY:
> + /*
> + * ENODEV if the feature is not supported. This removes the need
> + * of separate IS_SUPPORTED parameter.
> + */
Code speaks for itself, the comment is not needed.
> + if (INTEL_GEN(dev_priv) < 9)
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent(ctx);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (INTEL_GEN(dev_priv) < 9)
> + ret = -ENODEV;
> + else if (args->value)
> + ret = i915_gem_context_set_data_port_coherent(ctx);
> + else
> + ret = i915_gem_context_clear_data_port_coherent(ctx);
Be more strict with the uAPI. Only accept values 0 or 1, then you leave
space for extension in the future.
> + break;
> +
> default:
> ret = -EINVAL;
> break;
<SNIP>
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
I'm feeling this is not the right file. The bit is in hardware context,
and doesn't have so much to do with LRC.
> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> + cs = intel_ring_begin(req, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(req->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + /* FIXME: this feature may be unuseable on CNL; If this checks to be
> + * true, we should enodev for CNL. */
This is exactly why we want the IGT tests to check for effects, not for
the register. Then we can get an answer by running the tests on all kind
of CNL systems at hand.
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + /* Enabling coherency means disabling the bit which forces it off */
Code is again very self explanatory without the comment.
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(req, cs);
> +
> + return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable)
> +{
> + struct i915_request *req;
> + int ret;
> +
> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> + if (IS_ERR(req))
> + return PTR_ERR(req);
> +
> + ret = emit_set_data_port_coherency(req, enable);
> +
> + i915_request_add(req);
> +
> + return ret;
> +}
I'm thinking we should set this value when it has changed, when we insert the
requests into the command stream. So if you change back and forth, while
not emitting any requests, nothing really happens. If you change the value and
emit a request, we should emit a LRI before the jump to the commands.
Similary if you keep setting the value to the value it already was in,
nothing will happen, again.
> +
> static struct i915_priolist *
> lookup_priolist(struct intel_engine_cs *engine, int prio)
> {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..214e291 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>
> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable);
> +
> #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..fab072f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE 0x4
> #define I915_CONTEXT_PARAM_BANNABLE 0x5
> #define I915_CONTEXT_PARAM_PRIORITY 0x6
> +#define I915_CONTEXT_PARAM_COHERENCY 0x7
Please add this line after the indented context priorities.
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
Here.
Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-21 6:39 ` Joonas Lahtinen
@ 2018-06-21 13:47 ` Lis, Tomasz
2018-07-18 13:03 ` Joonas Lahtinen
0 siblings, 1 reply; 88+ messages in thread
From: Lis, Tomasz @ 2018-06-21 13:47 UTC (permalink / raw)
To: Joonas Lahtinen, intel-gfx; +Cc: bartosz.dunajski
On 2018-06-21 08:39, Joonas Lahtinen wrote:
> Changelog would be much appreciated. And this is not the first version
> of the series. It helps to remind the reviewer that original
> implementation was changed into IOCTl based on feedback. Please see the
> git log in i915 for some examples.
Will add. I considered this a separate series, as it is a different
implementation.
>
> Quoting Tomasz Lis (2018-06-20 18:03:07)
>> The patch adds a parameter to control the data port coherency functionality
>> on a per-context level. When the IOCTL is called, a command to switch data
>> port coherency state is added to the ordered list. All prior requests are
>> executed on old coherency settings, and all exec requests after the IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
>> overhead related to tracking (snooping) memory inside different cache units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
>> memory coherency between CPU and GPU). The goal of coherency enable/disable
>> switch is to remove overhead of memory coherency when memory coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units), Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
>> model is similar to regular memory load/store operations available on typical
>> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>> ________________
>> | NODE1 |
>> | uint64_t data |
>> +----------------|
>> | NODE* | NODE*|
>> +--------+-------+
>> / \
>> ________________/ \________________
>> | NODE2 | | NODE3 |
>> | uint64_t data | | uint64_t data |
>> +----------------| +----------------|
>> | NODE* | NODE*| | NODE* | NODE*|
>> +--------+-------+ +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory locations
>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
>> Virtual Memory feature). Using pointers from different allocations doesn't
>> affect the stateless addressing model which even allows scattered reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in stateless model
>> can be encoded (at ISA level) to either use or disable coherency. However, for
>> generic OCL applications (such as example with tree-like data structure), OCL
>> compiler is not able to determine origin of memory pointed to by an arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is needed or not
>> for specific pointer (or for specific I/O instruction). As a result, compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that it would be
>> possible to workaround this (e.g. based on allocations map and pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
>> the payload that kern_master produced. These two kernels work in a loop, one
>> after another. Since only kern_master requires coherency, kern_worker should
>> not be forced to pay for it. This means that we need to have the ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> <SNIP>
>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index ccf463a..ea65ae6 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,24 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
>> return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
>> }
>>
>> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
>> +{
>> + int ret;
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> + if (!GEM_WARN_ON(ret))
> I don't think there's need for the WARN as the error will be propagated
> back to userspace?
You're right.
>
>> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>> +{
>> + int ret;
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> + if (!GEM_WARN_ON(ret))
> Ditto.
ack
>
>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>> +}
>> +
>> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>> struct drm_file *file)
>> {
>> @@ -784,6 +802,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *dev_priv = to_i915(dev);
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>> case I915_CONTEXT_PARAM_PRIORITY:
>> args->value = ctx->sched.priority;
>> break;
>> + case I915_CONTEXT_PARAM_COHERENCY:
>> + /*
>> + * ENODEV if the feature is not supported. This removes the need
>> + * of separate IS_SUPPORTED parameter.
>> + */
> Code speaks for itself, the comment is not needed.
I don't think it is a good idea to limit comments. The current look of
the code makes it hard for anyone new to work on it, as the only
documentation is the history in mailing list.
I don't think it's the correct approach. I believe comments should be
encouraged.
In this specific case, the code lets you know that ENODEV is returned
below gen9. But there is no macro IS_DATA_PORT_COHERENCY_SUPPORTED()
which would clearly indicate the cause of that, so comment is required.
>> + if (INTEL_GEN(dev_priv) < 9)
>> + ret = -ENODEV;
>> + else
>> + args->value = i915_gem_context_is_data_port_coherent(ctx);
>> + break;
>> default:
>> ret = -EINVAL;
>> break;
>> @@ -893,6 +923,17 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>> }
>> break;
>>
>> + case I915_CONTEXT_PARAM_COHERENCY:
>> + if (args->size)
>> + ret = -EINVAL;
>> + else if (INTEL_GEN(dev_priv) < 9)
>> + ret = -ENODEV;
>> + else if (args->value)
>> + ret = i915_gem_context_set_data_port_coherent(ctx);
>> + else
>> + ret = i915_gem_context_clear_data_port_coherent(ctx);
> Be more strict with the uAPI. Only accept values 0 or 1, then you leave
> space for extension in the future.
Right. Will do.
>
>> + break;
>> +
>> default:
>> ret = -EINVAL;
>> break;
> <SNIP>
>
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> I'm feeling this is not the right file. The bit is in hardware context,
> and doesn't have so much to do with LRC.
Should I move it to i915_gem_context.c?
>
>> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>> ce->lrc_desc = desc;
>> }
>>
>> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
>> +{
>> + u32 *cs;
>> + i915_reg_t reg;
>> +
>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> + cs = intel_ring_begin(req, 4);
>> + if (IS_ERR(cs))
>> + return PTR_ERR(cs);
>> +
>> + if (INTEL_GEN(req->i915) >= 10)
>> + reg = CNL_HDC_CHICKEN0;
>> + else
>> + reg = HDC_CHICKEN0;
>> +
>> + /* FIXME: this feature may be unuseable on CNL; If this checks to be
>> + * true, we should enodev for CNL. */
> This is exactly why we want the IGT tests to check for effects, not for
> the register. Then we can get an answer by running the tests on all kind
> of CNL systems at hand.
This comment is actually outdated, I left it by mistake. Will remove.
>
>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>> + *cs++ = i915_mmio_reg_offset(reg);
>> + /* Enabling coherency means disabling the bit which forces it off */
> Code is again very self explanatory without the comment.
The logic is reversed, so that "enable" does a "disable". I believe the
comment does a great job of assuring the reader that this is not just a
coding mistake.
Do we have any official guidelines for limiting comments?
>
>> + if (enable)
>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> + else
>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> + *cs++ = MI_NOOP;
>> +
>> + intel_ring_advance(req, cs);
>> +
>> + return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> + bool enable)
>> +{
>> + struct i915_request *req;
>> + int ret;
>> +
>> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> + if (IS_ERR(req))
>> + return PTR_ERR(req);
>> +
>> + ret = emit_set_data_port_coherency(req, enable);
>> +
>> + i915_request_add(req);
>> +
>> + return ret;
>> +}
> I'm thinking we should set this value when it has changed, when we insert the
> requests into the command stream. So if you change back and forth, while
> not emitting any requests, nothing really happens. If you change the value and
> emit a request, we should emit a LRI before the jump to the commands.
> Similary if you keep setting the value to the value it already was in,
> nothing will happen, again.
When I considered that, my way of reasoning was:
If we execute the flag changing buffer right away, it may be sent to
hardware faster if there is no job in progress.
If we use the lazy way, and trigger the change just before submission -
there will be additional conditions in submission code, plus the change
will be made when there is another job pending (though it's not a
considerable payload to just switch a flag).
If user space switches the flag back and forth without much sense, then
there is something wrong with the user space driver, and it shouldn't be
up to kernel to fix that.
This is why I chosen the current approach. But I can change it if you wish.
>> +
>> static struct i915_priolist *
>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>> {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..214e291 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>
>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> + bool enable);
>> +
>> #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..fab072f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>> #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE 0x4
>> #define I915_CONTEXT_PARAM_BANNABLE 0x5
>> #define I915_CONTEXT_PARAM_PRIORITY 0x6
>> +#define I915_CONTEXT_PARAM_COHERENCY 0x7
> Please add this line after the indented context priorities.
ack
>
>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> Here.
>
> Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-21 13:47 ` Lis, Tomasz
@ 2018-07-18 13:03 ` Joonas Lahtinen
0 siblings, 0 replies; 88+ messages in thread
From: Joonas Lahtinen @ 2018-07-18 13:03 UTC (permalink / raw)
To: Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski
Quoting Lis, Tomasz (2018-06-21 16:47:45)
> On 2018-06-21 08:39, Joonas Lahtinen wrote:
> > Quoting Tomasz Lis (2018-06-20 18:03:07)
> >> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >> struct drm_file *file)
> >> {
> >> + struct drm_i915_private *dev_priv = to_i915(dev);
> >> struct drm_i915_file_private *file_priv = file->driver_priv;
> >> struct drm_i915_gem_context_param *args = data;
> >> struct i915_gem_context *ctx;
> >> @@ -818,6 +837,16 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >> case I915_CONTEXT_PARAM_PRIORITY:
> >> args->value = ctx->sched.priority;
> >> break;
> >> + case I915_CONTEXT_PARAM_COHERENCY:
> >> + /*
> >> + * ENODEV if the feature is not supported. This removes the need
> >> + * of separate IS_SUPPORTED parameter.
> >> + */
> > Code speaks for itself, the comment is not needed.
> I don't think it is a good idea to limit comments. The current look of
> the code makes it hard for anyone new to work on it, as the only
> documentation is the history in mailing list.
> I don't think it's the correct approach. I believe comments should be
> encouraged.
It's not a matter of opinion. Code should be written clear enough that
comments are not needed.
<SNIP>
> >> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> >> + *cs++ = i915_mmio_reg_offset(reg);
> >> + /* Enabling coherency means disabling the bit which forces it off */
> > Code is again very self explanatory without the comment.
> The logic is reversed, so that "enable" does a "disable". I believe the
> comment does a great job of assuring the reader that this is not just a
> coding mistake.
>
> Do we have any official guidelines for limiting comments?
Yes, avoid where humanly possible. And when you can't avoid, it should
explain "why" not what. I don't see such cases in this patch.
<SNIP>
> > I'm thinking we should set this value when it has changed, when we insert the
> > requests into the command stream. So if you change back and forth, while
> > not emitting any requests, nothing really happens. If you change the value and
> > emit a request, we should emit a LRI before the jump to the commands.
> > Similary if you keep setting the value to the value it already was in,
> > nothing will happen, again.
> When I considered that, my way of reasoning was:
> If we execute the flag changing buffer right away, it may be sent to
> hardware faster if there is no job in progress.
> If we use the lazy way, and trigger the change just before submission -
> there will be additional conditions in submission code, plus the change
> will be made when there is another job pending (though it's not a
> considerable payload to just switch a flag).
> If user space switches the flag back and forth without much sense, then
> there is something wrong with the user space driver, and it shouldn't be
> up to kernel to fix that.
A few register writes appended before jumping to the BB should not be a
performance concern. There will definitely be more overhead in sending a
whole separate request, so I'm not sure I follow whole picture.
So I still think it's right thing to do to only emit the commands as a
prelude when needed.
Regards, Joonas
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-20 15:03 ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
2018-06-21 6:39 ` Joonas Lahtinen
@ 2018-06-21 7:05 ` Chris Wilson
2018-06-21 13:47 ` Lis, Tomasz
2018-06-21 7:31 ` Dunajski, Bartosz
2 siblings, 1 reply; 88+ messages in thread
From: Chris Wilson @ 2018-06-21 7:05 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
Quoting Tomasz Lis (2018-06-20 16:03:07)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 33bc914..c69dc26 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> + cs = intel_ring_begin(req, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(req->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + /* FIXME: this feature may be unuseable on CNL; If this checks to be
> + * true, we should enodev for CNL. */
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + /* Enabling coherency means disabling the bit which forces it off */
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(req, cs);
> +
> + return 0;
> +}
There's nothing specific to the logical ringbuffer context here afaics.
It could have just been done inside the single
i915_gem_context_set_data_port_coherency(). Also makes it clearer that
i915_gem_context_set_data_port_coherency needs struct_mutex.
cmd = HDC_FORCE_NON_COHERENT << 16;
if (!coherent)
cmd |= HDC_FORCE_NON_COHERENT;
*cs++ = cmd;
Does that read any clearer?
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..214e291 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>
> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable);
> +
> #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..fab072f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE 0x4
> #define I915_CONTEXT_PARAM_BANNABLE 0x5
> #define I915_CONTEXT_PARAM_PRIORITY 0x6
> +#define I915_CONTEXT_PARAM_COHERENCY 0x7
DATAPORT_COHERENCY
There are many different caches.
There should be some commentary around here telling userspace what the
contract is.
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
COHERENCY has MAX/MIN_USER_PRIORITY, interesting. I thought it was just
a boolean.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-21 7:05 ` Chris Wilson
@ 2018-06-21 13:47 ` Lis, Tomasz
0 siblings, 0 replies; 88+ messages in thread
From: Lis, Tomasz @ 2018-06-21 13:47 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: bartosz.dunajski
On 2018-06-21 09:05, Chris Wilson wrote:
> Quoting Tomasz Lis (2018-06-20 16:03:07)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 33bc914..c69dc26 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -258,6 +258,57 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
>> ce->lrc_desc = desc;
>> }
>>
>> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
>> +{
>> + u32 *cs;
>> + i915_reg_t reg;
>> +
>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> + cs = intel_ring_begin(req, 4);
>> + if (IS_ERR(cs))
>> + return PTR_ERR(cs);
>> +
>> + if (INTEL_GEN(req->i915) >= 10)
>> + reg = CNL_HDC_CHICKEN0;
>> + else
>> + reg = HDC_CHICKEN0;
>> +
>> + /* FIXME: this feature may be unuseable on CNL; If this checks to be
>> + * true, we should enodev for CNL. */
>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>> + *cs++ = i915_mmio_reg_offset(reg);
>> + /* Enabling coherency means disabling the bit which forces it off */
>> + if (enable)
>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> + else
>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> + *cs++ = MI_NOOP;
>> +
>> + intel_ring_advance(req, cs);
>> +
>> + return 0;
>> +}
> There's nothing specific to the logical ringbuffer context here afaics.
> It could have just been done inside the single
> i915_gem_context_set_data_port_coherency(). Also makes it clearer that
> i915_gem_context_set_data_port_coherency needs struct_mutex.
>
> cmd = HDC_FORCE_NON_COHERENT << 16;
> if (!coherent)
> cmd |= HDC_FORCE_NON_COHERENT;
> *cs++ = cmd;
>
> Does that read any clearer?
Sorry, I don't think I follow.
Should I move the code out of logical ringbuffer context (intel_lrc.c)?
Should I merge the emit_set_data_port_coherency() with
intel_lr_context_modify_data_port_coherency()?
Should I lock a mutex while adding the request?
-Tomasz
>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..214e291 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>
>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
>> + bool enable);
>> +
>> #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..fab072f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1453,6 +1453,7 @@ struct drm_i915_gem_context_param {
>> #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE 0x4
>> #define I915_CONTEXT_PARAM_BANNABLE 0x5
>> #define I915_CONTEXT_PARAM_PRIORITY 0x6
>> +#define I915_CONTEXT_PARAM_COHERENCY 0x7
> DATAPORT_COHERENCY
> There are many different caches.
>
> There should be some commentary around here telling userspace what the
> contract is.
Will do.
>
>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> COHERENCY has MAX/MIN_USER_PRIORITY, interesting. I thought it was just
> a boolean.
> -Chris
I did not noticed the structure of defines here; will move the new define.
-Tomasz
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-20 15:03 ` [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency Tomasz Lis
2018-06-21 6:39 ` Joonas Lahtinen
2018-06-21 7:05 ` Chris Wilson
@ 2018-06-21 7:31 ` Dunajski, Bartosz
2018-06-21 8:48 ` Joonas Lahtinen
2 siblings, 1 reply; 88+ messages in thread
From: Dunajski, Bartosz @ 2018-06-21 7:31 UTC (permalink / raw)
To: Lis, Tomasz, intel-gfx
I would like to add few things that were mentioned previously.
According to adoption plan.
Our plan is to drop dependency on LLVM 4.0.1 (with custom patches) and instead compile with unpatched (either system or vanilla) LLVM 6.0. Work to transition our compiler stack to LLVM 6 is expected to complete in late Q3. Additionally, we are refactoring our packaging, so instead of a single neo package with multiple components we will have multiple versioned packages with clear dependencies.
We are coordinating with ClearLinux team to get included once that happens and plan to reach out to other OSVs to do the same.
Is this plan enough to consider NEO an actual opensource client for the coherency control patch ?
PR for patch usage:
https://github.com/intel/compute-runtime/pull/53
Bartosz
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-21 7:31 ` Dunajski, Bartosz
@ 2018-06-21 8:48 ` Joonas Lahtinen
2018-06-22 16:40 ` Dunajski, Bartosz
0 siblings, 1 reply; 88+ messages in thread
From: Joonas Lahtinen @ 2018-06-21 8:48 UTC (permalink / raw)
To: Dunajski, Bartosz, Lis, Tomasz, intel-gfx, Dave Airlie
+ Dave Airlie (The DRM subsystem maintainer) for FYI
Quoting Dunajski, Bartosz (2018-06-21 10:31:57)
> I would like to add few things that were mentioned previously.
>
> According to adoption plan.
> Our plan is to drop dependency on LLVM 4.0.1 (with custom patches) and instead compile with unpatched (either system or vanilla) LLVM 6.0. Work to transition our compiler stack to LLVM 6 is expected to complete in late Q3. Additionally, we are refactoring our packaging, so instead of a single neo package with multiple components we will have multiple versioned packages with clear dependencies.
>
> We are coordinating with ClearLinux team to get included once that happens and plan to reach out to other OSVs to do the same.
>
> Is this plan enough to consider NEO an actual opensource client for the coherency control patch ?
Yes, once you follow through with the plan, there should be no issues
about merging patches to support the driver.
You may want to squeeze your timeline to be complete before 4.19-rc5,
which is the feature cutoff date for 4.20, but that is rather an
ambitious goal. Your original schedule would land the patches before
4.20-rc5 resulting in inclusion to 4.21.
Regards, Joonas
PS. I'm going on a vacation for a couple of weeks.
>
> PR for patch usage:
> https://github.com/intel/compute-runtime/pull/53
>
> Bartosz
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-21 8:48 ` Joonas Lahtinen
@ 2018-06-22 16:40 ` Dunajski, Bartosz
2018-07-18 13:12 ` Joonas Lahtinen
0 siblings, 1 reply; 88+ messages in thread
From: Dunajski, Bartosz @ 2018-06-22 16:40 UTC (permalink / raw)
To: Joonas Lahtinen, Lis, Tomasz, intel-gfx, Dave Airlie
Additionally, we are already on Arch:
https://aur.archlinux.org/packages/compute-runtime
Can I assume that adoption plan is not a blocker anymore?
Bartosz
> Yes, once you follow through with the plan, there should be no issues about merging patches to support the driver.
>
> You may want to squeeze your timeline to be complete before 4.19-rc5, which is the feature cutoff date for 4.20, but that is rather an ambitious goal. Your original schedule would land the patches before
> 4.20-rc5 resulting in inclusion to 4.21.
>
> Regards, Joonas
>
> PS. I'm going on a vacation for a couple of weeks.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-06-22 16:40 ` Dunajski, Bartosz
@ 2018-07-18 13:12 ` Joonas Lahtinen
2018-07-18 13:27 ` Dunajski, Bartosz
0 siblings, 1 reply; 88+ messages in thread
From: Joonas Lahtinen @ 2018-07-18 13:12 UTC (permalink / raw)
To: Dunajski, Bartosz, Lis, Tomasz, intel-gfx, Dave Airlie
Quoting Dunajski, Bartosz (2018-06-22 19:40:58)
> Additionally, we are already on Arch:
> https://aur.archlinux.org/packages/compute-runtime
I'm not an Arch user myself, but my impression is that AUR [1] is equivalent
of Ubuntu's PPA where anybody can very much upload anything outside of
the support model of the distro.
> Can I assume that adoption plan is not a blocker anymore?
Due to above, I don't think that matter is changed to direction or
another.
Regards, Joonas
[1] https://wiki.archlinux.org/index.php/Arch_User_Repository#What_is_the_AUR.3F
>
> Bartosz
>
> > Yes, once you follow through with the plan, there should be no issues about merging patches to support the driver.
> >
> > You may want to squeeze your timeline to be complete before 4.19-rc5, which is the feature cutoff date for 4.20, but that is rather an ambitious goal. Your original schedule would land the patches before
> > 4.20-rc5 resulting in inclusion to 4.21.
> >
> > Regards, Joonas
> >
> > PS. I'm going on a vacation for a couple of weeks.
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v1] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-18 13:12 ` Joonas Lahtinen
@ 2018-07-18 13:27 ` Dunajski, Bartosz
0 siblings, 0 replies; 88+ messages in thread
From: Dunajski, Bartosz @ 2018-07-18 13:27 UTC (permalink / raw)
To: Joonas Lahtinen, Lis, Tomasz, intel-gfx, Dave Airlie
You are right about the AUR. This is just a step into opensource community direction.
According to my previous answer about ClearLinux (and others), which is more important here. We are still coordinating this, but I think we are on the right path. And NEO can be considered as opensource client for the coherency patch.
> I'm not an Arch user myself, but my impression is that AUR [1] is equivalent of Ubuntu's PPA where anybody can very much upload anything outside of the support model of the distro.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
` (3 preceding siblings ...)
2018-06-20 15:03 ` [PATCH v1] Second implementation of Data Port Coherency Tomasz Lis
@ 2018-07-09 13:20 ` Tomasz Lis
2018-07-09 13:48 ` Lionel Landwerlin
2018-07-09 16:28 ` Tvrtko Ursulin
2018-07-12 15:10 ` [PATCH v5] " Tomasz Lis
` (2 subsequent siblings)
7 siblings, 2 replies; 88+ messages in thread
From: Tomasz Lis @ 2018-07-09 13:20 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).
When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
Removed redundant GEM_WARN_ON()s. Improved to coding standard.
Introduced a macro for checking whether hardware supports the feature.
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Bspec: 11419
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 1 +
drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
drivers/gpu/drm/i915/intel_lrc.c | 49 +++++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.h | 4 +++
include/uapi/drm/i915_drm.h | 6 ++++
6 files changed, 107 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 09ab124..7d4bbd5 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
#define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
#define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
#define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index b10770c..6db352e 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
}
+static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+ int ret;
+
+ ret = intel_lr_context_modify_data_port_coherency(ctx, true);
+ if (!ret)
+ __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ return ret;
+}
+
+static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+ int ret;
+
+ ret = intel_lr_context_modify_data_port_coherency(ctx, false);
+ if (!ret)
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+ return ret;
+}
+
int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
@@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *dev_priv = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_PRIORITY:
args->value = ctx->sched.priority;
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (!HAS_DATA_PORT_COHERENCY(dev_priv))
+ ret = -ENODEV;
+ else
+ args->value = i915_gem_context_is_data_port_coherent(ctx);
+ break;
default:
ret = -EINVAL;
break;
@@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *dev_priv = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
}
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (args->size)
+ ret = -EINVAL;
+ else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
+ ret = -ENODEV;
+ else if (args->value == 1)
+ ret = i915_gem_context_set_data_port_coherent(ctx);
+ else if (args->value == 0)
+ ret = i915_gem_context_clear_data_port_coherent(ctx);
+ else
+ ret = -EINVAL;
+ break;
+
default:
ret = -EINVAL;
break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..e8ccb70 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,7 @@ struct i915_gem_context {
#define CONTEXT_BANNABLE 3
#define CONTEXT_BANNED 4
#define CONTEXT_FORCE_SINGLE_SUBMISSION 5
+#define CONTEXT_DATA_PORT_COHERENT 6
/**
* @hw_id: - unique identifier for the context
@@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
}
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+ return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
+}
+
static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
{
return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ab89dab..1f037e3 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
ce->lrc_desc = desc;
}
+static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
+{
+ u32 *cs;
+ i915_reg_t reg;
+
+ GEM_BUG_ON(req->engine->class != RENDER_CLASS);
+ GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
+
+ cs = intel_ring_begin(req, 4);
+ if (IS_ERR(cs))
+ return PTR_ERR(cs);
+
+ if (INTEL_GEN(req->i915) >= 10)
+ reg = CNL_HDC_CHICKEN0;
+ else
+ reg = HDC_CHICKEN0;
+
+ *cs++ = MI_LOAD_REGISTER_IMM(1);
+ *cs++ = i915_mmio_reg_offset(reg);
+ /* Enabling coherency means disabling the bit which forces it off */
+ if (enable)
+ *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+ else
+ *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+ *cs++ = MI_NOOP;
+
+ intel_ring_advance(req, cs);
+
+ return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+ bool enable)
+{
+ struct i915_request *req;
+ int ret;
+
+ req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ ret = emit_set_data_port_coherency(req, enable);
+
+ i915_request_add(req);
+
+ return ret;
+}
+
static struct i915_priolist *
lookup_priolist(struct intel_engine_cs *engine, int prio)
{
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..f6965ae 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@ struct i915_gem_context;
void intel_lr_context_resume(struct drm_i915_private *dev_priv);
+int
+intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
+ bool enable);
+
#endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..e677bea 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
#define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
#define I915_CONTEXT_DEFAULT_PRIORITY 0
#define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU will update memory
+ * buffers shared with CPU, by forcing internal cache units to send memory
+ * writes to real RAM faster. Keeping such coherency has performance cost.
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
__u64 value;
};
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 13:20 ` [PATCH v4] " Tomasz Lis
@ 2018-07-09 13:48 ` Lionel Landwerlin
2018-07-09 14:03 ` Lis, Tomasz
2018-07-09 16:28 ` Tvrtko Ursulin
1 sibling, 1 reply; 88+ messages in thread
From: Lionel Landwerlin @ 2018-07-09 13:48 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
On 09/07/18 14:20, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
> Introduced a macro for checking whether hardware supports the feature.
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 1 +
> drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
> drivers/gpu/drm/i915/intel_lrc.c | 49 +++++++++++++++++++++++++++++++++
> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
> include/uapi/drm/i915_drm.h | 6 ++++
> 6 files changed, 107 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 09ab124..7d4bbd5 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
Reading the documentation it seems that the bit you want to set is gone
in ICL/Gen11.
Maybe limit this to >= 9 && < 11?
Cheers,
-
Lionel
>
> #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..6db352e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
> return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
> }
>
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> +
> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> + if (!ret)
> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> +
> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> + if (!ret)
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
> +}
> +
> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *dev_priv = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority;
> break;
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent(ctx);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *dev_priv = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> + ret = -ENODEV;
> + else if (args->value == 1)
> + ret = i915_gem_context_set_data_port_coherent(ctx);
> + else if (args->value == 0)
> + ret = i915_gem_context_clear_data_port_coherent(ctx);
> + else
> + ret = -EINVAL;
> + break;
> +
> default:
> ret = -EINVAL;
> break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..e8ccb70 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,7 @@ struct i915_gem_context {
> #define CONTEXT_BANNABLE 3
> #define CONTEXT_BANNED 4
> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
> +#define CONTEXT_DATA_PORT_COHERENT 6
>
> /**
> * @hw_id: - unique identifier for the context
> @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
> }
>
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +}
> +
> static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
> {
> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dab..1f037e3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> + cs = intel_ring_begin(req, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(req->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + /* Enabling coherency means disabling the bit which forces it off */
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(req, cs);
> +
> + return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable)
> +{
> + struct i915_request *req;
> + int ret;
> +
> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> + if (IS_ERR(req))
> + return PTR_ERR(req);
> +
> + ret = emit_set_data_port_coherency(req, enable);
> +
> + i915_request_add(req);
> +
> + return ret;
> +}
> +
> static struct i915_priolist *
> lookup_priolist(struct intel_engine_cs *engine, int prio)
> {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..f6965ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>
> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable);
> +
> #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..e677bea 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to real RAM faster. Keeping such coherency has performance cost.
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
> __u64 value;
> };
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 13:48 ` Lionel Landwerlin
@ 2018-07-09 14:03 ` Lis, Tomasz
2018-07-09 14:24 ` Lionel Landwerlin
0 siblings, 1 reply; 88+ messages in thread
From: Lis, Tomasz @ 2018-07-09 14:03 UTC (permalink / raw)
To: Lionel Landwerlin, intel-gfx; +Cc: bartosz.dunajski
On 2018-07-09 15:48, Lionel Landwerlin wrote:
> On 09/07/18 14:20, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch
>> data
>> port coherency state is added to the ordered list. All prior requests
>> are
>> executed on old coherency settings, and all exec requests after the
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
>> require
>> memory coherency between CPU and GPU). The goal of coherency
>> enable/disable
>> switch is to remove overhead of memory coherency when memory
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units),
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send"
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This
>> "stateless"
>> model is similar to regular memory load/store operations available on
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>> ________________
>> | NODE1 |
>> | uint64_t data |
>> +----------------|
>> | NODE* | NODE*|
>> +--------+-------+
>> / \
>> ________________/ \________________
>> | NODE2 | | NODE3 |
>> | uint64_t data | | uint64_t data |
>> +----------------| +----------------|
>> | NODE* | NODE*| | NODE* | NODE*|
>> +--------+-------+ +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory
>> locations
>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM -
>> Shared
>> Virtual Memory feature). Using pointers from different allocations
>> doesn't
>> affect the stateless addressing model which even allows scattered
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency.
>> However, for
>> generic OCL applications (such as example with tree-like data
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result,
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that
>> it would be
>> possible to workaround this (e.g. based on allocations map and
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain)
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and
>> processes
>> the payload that kern_master produced. These two kernels work in a
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker
>> should
>> not be forced to pay for it. This means that we need to have the
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>> Introduced a macro for checking whether hardware supports the
>> feature.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.h | 1 +
>> drivers/gpu/drm/i915/i915_gem_context.c | 41
>> +++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
>> drivers/gpu/drm/i915/intel_lrc.c | 49
>> +++++++++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>> include/uapi/drm/i915_drm.h | 6 ++++
>> 6 files changed, 107 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 09ab124..7d4bbd5 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
>> *dev_priv)
>> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
>> EDRAM_ENABLED))
>> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
>> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> Reading the documentation it seems that the bit you want to set is
> gone in ICL/Gen11.
> Maybe limit this to >= 9 && < 11?
Icelake actually has the bit as well, just the address is different.
I will add its support as a separate patch as soon as the change which
defines ICL_HDC_CHICKEN0 is accepted.
But in the current form - you are right, ICL is not supported.
I will update the condition.
-Tomasz
>
> Cheers,
>
> -
> Lionel
>
>> #define HWS_NEEDS_PHYSICAL(dev_priv)
>> ((dev_priv)->info.hws_needs_physical)
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..6db352e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct
>> drm_i915_file_private *file_priv)
>> return atomic_read(&file_priv->ban_score) >=
>> I915_CLIENT_SCORE_BANNED;
>> }
>> +static int i915_gem_context_set_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + int ret;
>> +
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> + if (!ret)
>> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + int ret;
>> +
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> + if (!ret)
>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>> +}
>> +
>> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>> struct drm_file *file)
>> {
>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *dev_priv = to_i915(dev);
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> case I915_CONTEXT_PARAM_PRIORITY:
>> args->value = ctx->sched.priority;
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> + ret = -ENODEV;
>> + else
>> + args->value = i915_gem_context_is_data_port_coherent(ctx);
>> + break;
>> default:
>> ret = -EINVAL;
>> break;
>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *dev_priv = to_i915(dev);
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct
>> drm_device *dev, void *data,
>> }
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (args->size)
>> + ret = -EINVAL;
>> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> + ret = -ENODEV;
>> + else if (args->value == 1)
>> + ret = i915_gem_context_set_data_port_coherent(ctx);
>> + else if (args->value == 0)
>> + ret = i915_gem_context_clear_data_port_coherent(ctx);
>> + else
>> + ret = -EINVAL;
>> + break;
>> +
>> default:
>> ret = -EINVAL;
>> break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..e8ccb70 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>> #define CONTEXT_BANNABLE 3
>> #define CONTEXT_BANNED 4
>> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
>> +#define CONTEXT_DATA_PORT_COHERENT 6
>> /**
>> * @hw_id: - unique identifier for the context
>> @@ -257,6 +258,11 @@ static inline void
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>> }
>> +static inline bool i915_gem_context_is_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +}
>> +
>> static inline bool i915_gem_context_is_default(const struct
>> i915_gem_context *c)
>> {
>> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index ab89dab..1f037e3 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct
>> i915_gem_context *ctx,
>> ce->lrc_desc = desc;
>> }
>> +static int emit_set_data_port_coherency(struct i915_request *req,
>> bool enable)
>> +{
>> + u32 *cs;
>> + i915_reg_t reg;
>> +
>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> + cs = intel_ring_begin(req, 4);
>> + if (IS_ERR(cs))
>> + return PTR_ERR(cs);
>> +
>> + if (INTEL_GEN(req->i915) >= 10)
>> + reg = CNL_HDC_CHICKEN0;
>> + else
>> + reg = HDC_CHICKEN0;
>> +
>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>> + *cs++ = i915_mmio_reg_offset(reg);
>> + /* Enabling coherency means disabling the bit which forces it
>> off */
>> + if (enable)
>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> + else
>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> + *cs++ = MI_NOOP;
>> +
>> + intel_ring_advance(req, cs);
>> +
>> + return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>> *ctx,
>> + bool enable)
>> +{
>> + struct i915_request *req;
>> + int ret;
>> +
>> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> + if (IS_ERR(req))
>> + return PTR_ERR(req);
>> +
>> + ret = emit_set_data_port_coherency(req, enable);
>> +
>> + i915_request_add(req);
>> +
>> + return ret;
>> +}
>> +
>> static struct i915_priolist *
>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>> {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..f6965ae 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>> *ctx,
>> + bool enable);
>> +
>> #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..e677bea 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send
>> memory
>> + * writes to real RAM faster. Keeping such coherency has performance
>> cost.
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>> __u64 value;
>> };
>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 14:03 ` Lis, Tomasz
@ 2018-07-09 14:24 ` Lionel Landwerlin
2018-07-09 15:21 ` Lis, Tomasz
0 siblings, 1 reply; 88+ messages in thread
From: Lionel Landwerlin @ 2018-07-09 14:24 UTC (permalink / raw)
To: Lis, Tomasz, intel-gfx; +Cc: bartosz.dunajski
On 09/07/18 15:03, Lis, Tomasz wrote:
>
>
> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>> On 09/07/18 14:20, Tomasz Lis wrote:
>>> The patch adds a parameter to control the data port coherency
>>> functionality
>>> on a per-context level. When the IOCTL is called, a command to
>>> switch data
>>> port coherency state is added to the ordered list. All prior
>>> requests are
>>> executed on old coherency settings, and all exec requests after the
>>> IOCTL
>>> will use new settings.
>>>
>>> Rationale:
>>>
>>> The OpenCL driver develpers requested a functionality to control cache
>>> coherency at data port level. Keeping the coherency at that level is
>>> disabled
>>> by default due to its performance costs. OpenCL driver is planning to
>>> enable it for a small subset of submissions, when such functionality is
>>> required. Below are answers to basic question explaining background
>>> of the functionality and reasoning for the proposed implementation:
>>>
>>> 1. Why do we need a coherency enable/disable switch for memory that
>>> is shared
>>> between CPU and GEN (GPU)?
>>>
>>> Memory coherency between CPU and GEN, while being a great feature
>>> that enables
>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
>>> architecture, adds
>>> overhead related to tracking (snooping) memory inside different
>>> cache units
>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
>>> require
>>> memory coherency between CPU and GPU). The goal of coherency
>>> enable/disable
>>> switch is to remove overhead of memory coherency when memory
>>> coherency is not
>>> needed.
>>>
>>> 2. Why do we need a global coherency switch?
>>>
>>> In order to support I/O commands from within EUs (Execution Units),
>>> Intel GEN
>>> ISA (GEN Instruction Set Assembly) contains dedicated "send"
>>> instructions.
>>> These send instructions provide several addressing models. One of these
>>> addressing models (named "stateless") provides most flexible I/O
>>> using plain
>>> virtual addresses (as opposed to buffer_handle+offset models). This
>>> "stateless"
>>> model is similar to regular memory load/store operations available
>>> on typical
>>> CPUs. Since this model provides I/O using arbitrary virtual
>>> addresses, it
>>> enables algorithmic designs that are based on pointer-to-pointer
>>> (e.g. buffer
>>> of pointers) concepts. For instance, it allows creating tree-like data
>>> structures such as:
>>> ________________
>>> | NODE1 |
>>> | uint64_t data |
>>> +----------------|
>>> | NODE* | NODE*|
>>> +--------+-------+
>>> / \
>>> ________________/ \________________
>>> | NODE2 | | NODE3 |
>>> | uint64_t data | | uint64_t data |
>>> +----------------| +----------------|
>>> | NODE* | NODE*| | NODE* | NODE*|
>>> +--------+-------+ +--------+-------+
>>>
>>> Please note that pointers inside such structures can point to memory
>>> locations
>>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in
>>> one OCL
>>> allocation while NODE3 resides in a completely separate OCL allocation.
>>> Additionally, such pointers can be shared with CPU (i.e. using SVM -
>>> Shared
>>> Virtual Memory feature). Using pointers from different allocations
>>> doesn't
>>> affect the stateless addressing model which even allows scattered
>>> reading from
>>> different allocations at the same time (i.e. by utilizing
>>> SIMD-nature of send
>>> instructions).
>>>
>>> When it comes to coherency programming, send instructions in
>>> stateless model
>>> can be encoded (at ISA level) to either use or disable coherency.
>>> However, for
>>> generic OCL applications (such as example with tree-like data
>>> structure), OCL
>>> compiler is not able to determine origin of memory pointed to by an
>>> arbitrary
>>> pointer - i.e. is not able to track given pointer back to a specific
>>> allocation. As such, it's not able to decide whether coherency is
>>> needed or not
>>> for specific pointer (or for specific I/O instruction). As a result,
>>> compiler
>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>> functional issues resulting from data corruption). Please note that
>>> it would be
>>> possible to workaround this (e.g. based on allocations map and
>>> pointer bounds
>>> checking prior to each I/O instruction) but the performance cost of
>>> such
>>> workaround would be many times greater than the cost of keeping
>>> coherency
>>> always enabled. As such, enabling/disabling memory coherency at GEN
>>> ISA level
>>> is not feasible and alternative method is needed.
>>>
>>> Such alternative solution is to have a global coherency switch that
>>> allows
>>> disabling coherency for single (though entire) GPU submission. This is
>>> beneficial because this way we:
>>> * can enable (and pay for) coherency only in submissions that
>>> actually need
>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>>> * don't care about coherency at GEN ISA granularity (no performance
>>> impact)
>>>
>>> 3. Will coherency switch be used frequently?
>>>
>>> There are scenarios that will require frequent toggling of the
>>> coherency
>>> switch.
>>> E.g. an application has two OCL compute kernels: kern_master and
>>> kern_worker.
>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>> computational work that needs to be executed. kern_master analyzes
>>> incoming
>>> work descriptors and populates a plain OCL buffer (non-fine-grain)
>>> with payload
>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and
>>> processes
>>> the payload that kern_master produced. These two kernels work in a
>>> loop, one
>>> after another. Since only kern_master requires coherency,
>>> kern_worker should
>>> not be forced to pay for it. This means that we need to have the
>>> ability to
>>> toggle coherency switch on or off per each GPU submission:
>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
>>> (ENABLE
>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>
>>> v2: Fixed compilation warning.
>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>> v4: Renamed and documented the API flag. Used strict values.
>>> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>> Introduced a macro for checking whether hardware supports the
>>> feature.
>>>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>>
>>> Bspec: 11419
>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_drv.h | 1 +
>>> drivers/gpu/drm/i915/i915_gem_context.c | 41
>>> +++++++++++++++++++++++++++
>>> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
>>> drivers/gpu/drm/i915/intel_lrc.c | 49
>>> +++++++++++++++++++++++++++++++++
>>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>>> include/uapi/drm/i915_drm.h | 6 ++++
>>> 6 files changed, 107 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>>> b/drivers/gpu/drm/i915/i915_drv.h
>>> index 09ab124..7d4bbd5 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
>>> *dev_priv)
>>> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
>>> EDRAM_ENABLED))
>>> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
>>> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>
>> Reading the documentation it seems that the bit you want to set is
>> gone in ICL/Gen11.
>> Maybe limit this to >= 9 && < 11?
> Icelake actually has the bit as well, just the address is different.
> I will add its support as a separate patch as soon as the change which
> defines ICL_HDC_CHICKEN0 is accepted.
> But in the current form - you are right, ICL is not supported.
> I will update the condition.
> -Tomasz
Just out of curiosity, what address is ICL_HD_CHICKEN0 at?
Thanks,
-
Lionel
>>
>> Cheers,
>>
>> -
>> Lionel
>>
>>> #define HWS_NEEDS_PHYSICAL(dev_priv)
>>> ((dev_priv)->info.hws_needs_physical)
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>> index b10770c..6db352e 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct
>>> drm_i915_file_private *file_priv)
>>> return atomic_read(&file_priv->ban_score) >=
>>> I915_CLIENT_SCORE_BANNED;
>>> }
>>> +static int i915_gem_context_set_data_port_coherent(struct
>>> i915_gem_context *ctx)
>>> +{
>>> + int ret;
>>> +
>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>> + if (!ret)
>>> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> + return ret;
>>> +}
>>> +
>>> +static int i915_gem_context_clear_data_port_coherent(struct
>>> i915_gem_context *ctx)
>>> +{
>>> + int ret;
>>> +
>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>> + if (!ret)
>>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> + return ret;
>>> +}
>>> +
>>> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>> struct drm_file *file)
>>> {
>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct
>>> drm_device *dev, void *data,
>>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
>>> *data,
>>> struct drm_file *file)
>>> {
>>> + struct drm_i915_private *dev_priv = to_i915(dev);
>>> struct drm_i915_file_private *file_priv = file->driver_priv;
>>> struct drm_i915_gem_context_param *args = data;
>>> struct i915_gem_context *ctx;
>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct
>>> drm_device *dev, void *data,
>>> case I915_CONTEXT_PARAM_PRIORITY:
>>> args->value = ctx->sched.priority;
>>> break;
>>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> + ret = -ENODEV;
>>> + else
>>> + args->value = i915_gem_context_is_data_port_coherent(ctx);
>>> + break;
>>> default:
>>> ret = -EINVAL;
>>> break;
>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct
>>> drm_device *dev, void *data,
>>> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void
>>> *data,
>>> struct drm_file *file)
>>> {
>>> + struct drm_i915_private *dev_priv = to_i915(dev);
>>> struct drm_i915_file_private *file_priv = file->driver_priv;
>>> struct drm_i915_gem_context_param *args = data;
>>> struct i915_gem_context *ctx;
>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct
>>> drm_device *dev, void *data,
>>> }
>>> break;
>>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>> + if (args->size)
>>> + ret = -EINVAL;
>>> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>> + ret = -ENODEV;
>>> + else if (args->value == 1)
>>> + ret = i915_gem_context_set_data_port_coherent(ctx);
>>> + else if (args->value == 0)
>>> + ret = i915_gem_context_clear_data_port_coherent(ctx);
>>> + else
>>> + ret = -EINVAL;
>>> + break;
>>> +
>>> default:
>>> ret = -EINVAL;
>>> break;
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h
>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>> index b116e49..e8ccb70 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>> #define CONTEXT_BANNABLE 3
>>> #define CONTEXT_BANNED 4
>>> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
>>> +#define CONTEXT_DATA_PORT_COHERENT 6
>>> /**
>>> * @hw_id: - unique identifier for the context
>>> @@ -257,6 +258,11 @@ static inline void
>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>> }
>>> +static inline bool i915_gem_context_is_data_port_coherent(struct
>>> i915_gem_context *ctx)
>>> +{
>>> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> +}
>>> +
>>> static inline bool i915_gem_context_is_default(const struct
>>> i915_gem_context *c)
>>> {
>>> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>> index ab89dab..1f037e3 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct
>>> i915_gem_context *ctx,
>>> ce->lrc_desc = desc;
>>> }
>>> +static int emit_set_data_port_coherency(struct i915_request *req,
>>> bool enable)
>>> +{
>>> + u32 *cs;
>>> + i915_reg_t reg;
>>> +
>>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>> +
>>> + cs = intel_ring_begin(req, 4);
>>> + if (IS_ERR(cs))
>>> + return PTR_ERR(cs);
>>> +
>>> + if (INTEL_GEN(req->i915) >= 10)
>>> + reg = CNL_HDC_CHICKEN0;
>>> + else
>>> + reg = HDC_CHICKEN0;
>>> +
>>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>>> + *cs++ = i915_mmio_reg_offset(reg);
>>> + /* Enabling coherency means disabling the bit which forces it
>>> off */
>>> + if (enable)
>>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>> + else
>>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>> + *cs++ = MI_NOOP;
>>> +
>>> + intel_ring_advance(req, cs);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>>> *ctx,
>>> + bool enable)
>>> +{
>>> + struct i915_request *req;
>>> + int ret;
>>> +
>>> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>> + if (IS_ERR(req))
>>> + return PTR_ERR(req);
>>> +
>>> + ret = emit_set_data_port_coherency(req, enable);
>>> +
>>> + i915_request_add(req);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> static struct i915_priolist *
>>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>>> {
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>> index 1593194..f6965ae 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>> +int
>>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>>> *ctx,
>>> + bool enable);
>>> +
>>> #endif /* _INTEL_LRC_H_ */
>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>> index 7f5634c..e677bea 100644
>>> --- a/include/uapi/drm/i915_drm.h
>>> +++ b/include/uapi/drm/i915_drm.h
>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send
>>> memory
>>> + * writes to real RAM faster. Keeping such coherency has
>>> performance cost.
>>> + */
>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>>> __u64 value;
>>> };
>>
>>
>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 14:24 ` Lionel Landwerlin
@ 2018-07-09 15:21 ` Lis, Tomasz
0 siblings, 0 replies; 88+ messages in thread
From: Lis, Tomasz @ 2018-07-09 15:21 UTC (permalink / raw)
To: Lionel Landwerlin, intel-gfx; +Cc: bartosz.dunajski
On 2018-07-09 16:24, Lionel Landwerlin wrote:
> On 09/07/18 15:03, Lis, Tomasz wrote:
>>
>>
>> On 2018-07-09 15:48, Lionel Landwerlin wrote:
>>> On 09/07/18 14:20, Tomasz Lis wrote:
>>>> The patch adds a parameter to control the data port coherency
>>>> functionality
>>>> on a per-context level. When the IOCTL is called, a command to
>>>> switch data
>>>> port coherency state is added to the ordered list. All prior
>>>> requests are
>>>> executed on old coherency settings, and all exec requests after the
>>>> IOCTL
>>>> will use new settings.
>>>>
>>>> Rationale:
>>>>
>>>> The OpenCL driver develpers requested a functionality to control cache
>>>> coherency at data port level. Keeping the coherency at that level
>>>> is disabled
>>>> by default due to its performance costs. OpenCL driver is planning to
>>>> enable it for a small subset of submissions, when such
>>>> functionality is
>>>> required. Below are answers to basic question explaining background
>>>> of the functionality and reasoning for the proposed implementation:
>>>>
>>>> 1. Why do we need a coherency enable/disable switch for memory that
>>>> is shared
>>>> between CPU and GEN (GPU)?
>>>>
>>>> Memory coherency between CPU and GEN, while being a great feature
>>>> that enables
>>>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
>>>> architecture, adds
>>>> overhead related to tracking (snooping) memory inside different
>>>> cache units
>>>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>>>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
>>>> require
>>>> memory coherency between CPU and GPU). The goal of coherency
>>>> enable/disable
>>>> switch is to remove overhead of memory coherency when memory
>>>> coherency is not
>>>> needed.
>>>>
>>>> 2. Why do we need a global coherency switch?
>>>>
>>>> In order to support I/O commands from within EUs (Execution Units),
>>>> Intel GEN
>>>> ISA (GEN Instruction Set Assembly) contains dedicated "send"
>>>> instructions.
>>>> These send instructions provide several addressing models. One of
>>>> these
>>>> addressing models (named "stateless") provides most flexible I/O
>>>> using plain
>>>> virtual addresses (as opposed to buffer_handle+offset models). This
>>>> "stateless"
>>>> model is similar to regular memory load/store operations available
>>>> on typical
>>>> CPUs. Since this model provides I/O using arbitrary virtual
>>>> addresses, it
>>>> enables algorithmic designs that are based on pointer-to-pointer
>>>> (e.g. buffer
>>>> of pointers) concepts. For instance, it allows creating tree-like data
>>>> structures such as:
>>>> ________________
>>>> | NODE1 |
>>>> | uint64_t data |
>>>> +----------------|
>>>> | NODE* | NODE*|
>>>> +--------+-------+
>>>> / \
>>>> ________________/ \________________
>>>> | NODE2 | | NODE3 |
>>>> | uint64_t data | | uint64_t data |
>>>> +----------------| +----------------|
>>>> | NODE* | NODE*| | NODE* | NODE*|
>>>> +--------+-------+ +--------+-------+
>>>>
>>>> Please note that pointers inside such structures can point to
>>>> memory locations
>>>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in
>>>> one OCL
>>>> allocation while NODE3 resides in a completely separate OCL
>>>> allocation.
>>>> Additionally, such pointers can be shared with CPU (i.e. using SVM
>>>> - Shared
>>>> Virtual Memory feature). Using pointers from different allocations
>>>> doesn't
>>>> affect the stateless addressing model which even allows scattered
>>>> reading from
>>>> different allocations at the same time (i.e. by utilizing
>>>> SIMD-nature of send
>>>> instructions).
>>>>
>>>> When it comes to coherency programming, send instructions in
>>>> stateless model
>>>> can be encoded (at ISA level) to either use or disable coherency.
>>>> However, for
>>>> generic OCL applications (such as example with tree-like data
>>>> structure), OCL
>>>> compiler is not able to determine origin of memory pointed to by an
>>>> arbitrary
>>>> pointer - i.e. is not able to track given pointer back to a specific
>>>> allocation. As such, it's not able to decide whether coherency is
>>>> needed or not
>>>> for specific pointer (or for specific I/O instruction). As a
>>>> result, compiler
>>>> encodes all stateless sends as coherent (doing otherwise would lead to
>>>> functional issues resulting from data corruption). Please note that
>>>> it would be
>>>> possible to workaround this (e.g. based on allocations map and
>>>> pointer bounds
>>>> checking prior to each I/O instruction) but the performance cost of
>>>> such
>>>> workaround would be many times greater than the cost of keeping
>>>> coherency
>>>> always enabled. As such, enabling/disabling memory coherency at GEN
>>>> ISA level
>>>> is not feasible and alternative method is needed.
>>>>
>>>> Such alternative solution is to have a global coherency switch that
>>>> allows
>>>> disabling coherency for single (though entire) GPU submission. This is
>>>> beneficial because this way we:
>>>> * can enable (and pay for) coherency only in submissions that
>>>> actually need
>>>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER
>>>> resources)
>>>> * don't care about coherency at GEN ISA granularity (no performance
>>>> impact)
>>>>
>>>> 3. Will coherency switch be used frequently?
>>>>
>>>> There are scenarios that will require frequent toggling of the
>>>> coherency
>>>> switch.
>>>> E.g. an application has two OCL compute kernels: kern_master and
>>>> kern_worker.
>>>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>>>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>>>> computational work that needs to be executed. kern_master analyzes
>>>> incoming
>>>> work descriptors and populates a plain OCL buffer (non-fine-grain)
>>>> with payload
>>>> for kern_worker. Once kern_master is done, kern_worker kicks-in and
>>>> processes
>>>> the payload that kern_master produced. These two kernels work in a
>>>> loop, one
>>>> after another. Since only kern_master requires coherency,
>>>> kern_worker should
>>>> not be forced to pay for it. This means that we need to have the
>>>> ability to
>>>> toggle coherency switch on or off per each GPU submission:
>>>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
>>>> (ENABLE
>>>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>>>
>>>> v2: Fixed compilation warning.
>>>> v3: Refactored the patch to add IOCTL instead of exec flag.
>>>> v4: Renamed and documented the API flag. Used strict values.
>>>> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>>>> Introduced a macro for checking whether hardware supports the
>>>> feature.
>>>>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>>>
>>>> Bspec: 11419
>>>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/i915_drv.h | 1 +
>>>> drivers/gpu/drm/i915/i915_gem_context.c | 41
>>>> +++++++++++++++++++++++++++
>>>> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
>>>> drivers/gpu/drm/i915/intel_lrc.c | 49
>>>> +++++++++++++++++++++++++++++++++
>>>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>>>> include/uapi/drm/i915_drm.h | 6 ++++
>>>> 6 files changed, 107 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>>>> b/drivers/gpu/drm/i915/i915_drv.h
>>>> index 09ab124..7d4bbd5 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
>>>> *dev_priv)
>>>> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
>>>> EDRAM_ENABLED))
>>>> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
>>>> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>>>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>>>
>>> Reading the documentation it seems that the bit you want to set is
>>> gone in ICL/Gen11.
>>> Maybe limit this to >= 9 && < 11?
>> Icelake actually has the bit as well, just the address is different.
>> I will add its support as a separate patch as soon as the change
>> which defines ICL_HDC_CHICKEN0 is accepted.
>> But in the current form - you are right, ICL is not supported.
>> I will update the condition.
>> -Tomasz
>
> Just out of curiosity, what address is ICL_HD_CHICKEN0 at?
It was defined as _MMIO(0xE5F4). But now I see it is renamed to
ICL_HDC_MODE, and already on the tip.
Bspec: 19175
Wow, looks like I can include the gen11 support already. Will add in
next version.
Thank you!
>
> Thanks,
>
> -
> Lionel
>
>>>
>>> Cheers,
>>>
>>> -
>>> Lionel
>>>
>>>> #define HWS_NEEDS_PHYSICAL(dev_priv)
>>>> ((dev_priv)->info.hws_needs_physical)
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> index b10770c..6db352e 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct
>>>> drm_i915_file_private *file_priv)
>>>> return atomic_read(&file_priv->ban_score) >=
>>>> I915_CLIENT_SCORE_BANNED;
>>>> }
>>>> +static int i915_gem_context_set_data_port_coherent(struct
>>>> i915_gem_context *ctx)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>>>> + if (!ret)
>>>> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static int i915_gem_context_clear_data_port_coherent(struct
>>>> i915_gem_context *ctx)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>>> + if (!ret)
>>>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> + return ret;
>>>> +}
>>>> +
>>>> int i915_gem_context_create_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> struct drm_file *file)
>>>> {
>>>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> struct drm_file *file)
>>>> {
>>>> + struct drm_i915_private *dev_priv = to_i915(dev);
>>>> struct drm_i915_file_private *file_priv = file->driver_priv;
>>>> struct drm_i915_gem_context_param *args = data;
>>>> struct i915_gem_context *ctx;
>>>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> case I915_CONTEXT_PARAM_PRIORITY:
>>>> args->value = ctx->sched.priority;
>>>> break;
>>>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> + ret = -ENODEV;
>>>> + else
>>>> + args->value =
>>>> i915_gem_context_is_data_port_coherent(ctx);
>>>> + break;
>>>> default:
>>>> ret = -EINVAL;
>>>> break;
>>>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void
>>>> *data,
>>>> struct drm_file *file)
>>>> {
>>>> + struct drm_i915_private *dev_priv = to_i915(dev);
>>>> struct drm_i915_file_private *file_priv = file->driver_priv;
>>>> struct drm_i915_gem_context_param *args = data;
>>>> struct i915_gem_context *ctx;
>>>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct
>>>> drm_device *dev, void *data,
>>>> }
>>>> break;
>>>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>>>> + if (args->size)
>>>> + ret = -EINVAL;
>>>> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>>>> + ret = -ENODEV;
>>>> + else if (args->value == 1)
>>>> + ret = i915_gem_context_set_data_port_coherent(ctx);
>>>> + else if (args->value == 0)
>>>> + ret = i915_gem_context_clear_data_port_coherent(ctx);
>>>> + else
>>>> + ret = -EINVAL;
>>>> + break;
>>>> +
>>>> default:
>>>> ret = -EINVAL;
>>>> break;
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h
>>>> b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> index b116e49..e8ccb70 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>>>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>>>> #define CONTEXT_BANNABLE 3
>>>> #define CONTEXT_BANNED 4
>>>> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
>>>> +#define CONTEXT_DATA_PORT_COHERENT 6
>>>> /**
>>>> * @hw_id: - unique identifier for the context
>>>> @@ -257,6 +258,11 @@ static inline void
>>>> i915_gem_context_set_force_single_submission(struct i915_gem_
>>>> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>>>> }
>>>> +static inline bool i915_gem_context_is_data_port_coherent(struct
>>>> i915_gem_context *ctx)
>>>> +{
>>>> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> +}
>>>> +
>>>> static inline bool i915_gem_context_is_default(const struct
>>>> i915_gem_context *c)
>>>> {
>>>> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>>>> b/drivers/gpu/drm/i915/intel_lrc.c
>>>> index ab89dab..1f037e3 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct
>>>> i915_gem_context *ctx,
>>>> ce->lrc_desc = desc;
>>>> }
>>>> +static int emit_set_data_port_coherency(struct i915_request
>>>> *req, bool enable)
>>>> +{
>>>> + u32 *cs;
>>>> + i915_reg_t reg;
>>>> +
>>>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>>>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>>>> +
>>>> + cs = intel_ring_begin(req, 4);
>>>> + if (IS_ERR(cs))
>>>> + return PTR_ERR(cs);
>>>> +
>>>> + if (INTEL_GEN(req->i915) >= 10)
>>>> + reg = CNL_HDC_CHICKEN0;
>>>> + else
>>>> + reg = HDC_CHICKEN0;
>>>> +
>>>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>>>> + *cs++ = i915_mmio_reg_offset(reg);
>>>> + /* Enabling coherency means disabling the bit which forces it
>>>> off */
>>>> + if (enable)
>>>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>>>> + else
>>>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>>>> + *cs++ = MI_NOOP;
>>>> +
>>>> + intel_ring_advance(req, cs);
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +int
>>>> +intel_lr_context_modify_data_port_coherency(struct
>>>> i915_gem_context *ctx,
>>>> + bool enable)
>>>> +{
>>>> + struct i915_request *req;
>>>> + int ret;
>>>> +
>>>> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>>>> + if (IS_ERR(req))
>>>> + return PTR_ERR(req);
>>>> +
>>>> + ret = emit_set_data_port_coherency(req, enable);
>>>> +
>>>> + i915_request_add(req);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> static struct i915_priolist *
>>>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>>>> {
>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>>> index 1593194..f6965ae 100644
>>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>>>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>>>> +int
>>>> +intel_lr_context_modify_data_port_coherency(struct
>>>> i915_gem_context *ctx,
>>>> + bool enable);
>>>> +
>>>> #endif /* _INTEL_LRC_H_ */
>>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>>> index 7f5634c..e677bea 100644
>>>> --- a/include/uapi/drm/i915_drm.h
>>>> +++ b/include/uapi/drm/i915_drm.h
>>>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>>>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>>>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>>>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
>>>> +/*
>>>> + * When data port level coherency is enabled, the GPU will update
>>>> memory
>>>> + * buffers shared with CPU, by forcing internal cache units to
>>>> send memory
>>>> + * writes to real RAM faster. Keeping such coherency has
>>>> performance cost.
>>>> + */
>>>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>>>> __u64 value;
>>>> };
>>>
>>>
>>
>>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 13:20 ` [PATCH v4] " Tomasz Lis
2018-07-09 13:48 ` Lionel Landwerlin
@ 2018-07-09 16:28 ` Tvrtko Ursulin
2018-07-09 16:37 ` Chris Wilson
2018-07-10 18:03 ` Lis, Tomasz
1 sibling, 2 replies; 88+ messages in thread
From: Tvrtko Ursulin @ 2018-07-09 16:28 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
On 09/07/2018 14:20, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
> Introduced a macro for checking whether hardware supports the feature.
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 1 +
> drivers/gpu/drm/i915/i915_gem_context.c | 41 +++++++++++++++++++++++++++
> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
> drivers/gpu/drm/i915/intel_lrc.c | 49 +++++++++++++++++++++++++++++++++
> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
> include/uapi/drm/i915_drm.h | 6 ++++
> 6 files changed, 107 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 09ab124..7d4bbd5 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..6db352e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -711,6 +711,26 @@ static bool client_is_banned(struct drm_i915_file_private *file_priv)
> return atomic_read(&file_priv->ban_score) >= I915_CLIENT_SCORE_BANNED;
> }
>
> +static int i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> +
> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
> + if (!ret)
> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
> +}
> +
> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + int ret;
> +
> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> + if (!ret)
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> + return ret;
Is there a good reason you allow userspace to keep emitting unlimited
number of commands which actually do not change the status? If not
please consider gating the command emission with
test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
could keep toggling ad infinitum with no real work in between. Has it
been considered to only save the desired state in set param and then
emit it, if needed, before next execbuf? Minor thing in any case, just
curious since I wasn't following the threads.
> +}
> +
> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *dev_priv = to_i915(dev);
Feel free to use the local for the other existing to_i915(dev) call
sites in here.
Also use i915 for the local name. Unless I915_READ/WRITE is used i915 is
preferred nowadays.
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority;
> break;
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent(ctx);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *dev_priv = to_i915(dev);
As with get_param.
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
> + ret = -ENODEV;
> + else if (args->value == 1)
> + ret = i915_gem_context_set_data_port_coherent(ctx);
> + else if (args->value == 0)
> + ret = i915_gem_context_clear_data_port_coherent(ctx);
> + else
> + ret = -EINVAL;
> + break;
> +
> default:
> ret = -EINVAL;
> break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..e8ccb70 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,7 @@ struct i915_gem_context {
> #define CONTEXT_BANNABLE 3
> #define CONTEXT_BANNED 4
> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
> +#define CONTEXT_DATA_PORT_COHERENT 6
>
> /**
> * @hw_id: - unique identifier for the context
> @@ -257,6 +258,11 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
> }
>
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> +}
> +
> static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
> {
> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dab..1f037e3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *req, bool enable)
After much disagreement we ended up with rq as the consistent naming for
requests.
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
> +
> + cs = intel_ring_begin(req, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(req->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + /* Enabling coherency means disabling the bit which forces it off */
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(req, cs);
> +
> + return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable)
> +{
> + struct i915_request *req;
rq as above.
> + int ret;
> +
> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
> + if (IS_ERR(req))
> + return PTR_ERR(req);
> +
> + ret = emit_set_data_port_coherency(req, enable);
> +
> + i915_request_add(req);
> +
> + return ret;
> +}
> +
> static struct i915_priolist *
> lookup_priolist(struct intel_engine_cs *engine, int prio)
> {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..f6965ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>
> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context *ctx,
> + bool enable);
> +
> #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..e677bea 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to real RAM faster. Keeping such coherency has performance cost.
Is this comment correct? Is it actually sending memory writes to _RAM_,
or just the coherency mode enabled, even if only targetting CPU or
shared cache, which adds a cost?
s/Keeping such coherency has performance cost./Enabling data port
coherency has a performance cost./ ? Or "can have a performance cost"?
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
> __u64 value;
> };
>
>
Since I understand this design has been approved already on the high
level, and as you can see I only had some minor comments to add, I can
say that the patch in principle looks okay to me.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 16:28 ` Tvrtko Ursulin
@ 2018-07-09 16:37 ` Chris Wilson
2018-07-10 17:32 ` Lis, Tomasz
2018-07-10 18:03 ` Lis, Tomasz
1 sibling, 1 reply; 88+ messages in thread
From: Chris Wilson @ 2018-07-09 16:37 UTC (permalink / raw)
To: Tomasz Lis, Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski
Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
>
> On 09/07/2018 14:20, Tomasz Lis wrote:
> > +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> > +{
> > + int ret;
> > +
> > + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
> > + if (!ret)
> > + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
> > + return ret;
>
> Is there a good reason you allow userspace to keep emitting unlimited
> number of commands which actually do not change the status? If not
> please consider gating the command emission with
> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
> could keep toggling ad infinitum with no real work in between. Has it
> been considered to only save the desired state in set param and then
> emit it, if needed, before next execbuf? Minor thing in any case, just
> curious since I wasn't following the threads.
The first patch tried to add a bit to execbuf, and having been
mistakenly down that road before, we asked if there was any alternative.
(Now if you've also been following execbuf3 conversations, having a
packet for privileged LRI is definitely something we want.)
Setting the value in the context register is precisely what we want to
do, and trivially serialised with execbuf since we have to serialise
reservation of ring space, i.e. the normal rules of request generation.
(execbuf is just a client and nothing special). From that point of view,
we only care about frequency, if it is very frequent it should be
controlled by userspace inside the batch (but it can't due to there
being dangerous bits inside the reg aiui). At the other end of the
scale, is context_setparam for set-once. And there should be no
inbetween as that requires costly batch flushes.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 16:37 ` Chris Wilson
@ 2018-07-10 17:32 ` Lis, Tomasz
2018-07-11 9:28 ` Tvrtko Ursulin
0 siblings, 1 reply; 88+ messages in thread
From: Lis, Tomasz @ 2018-07-10 17:32 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski
[-- Attachment #1.1: Type: text/plain, Size: 3295 bytes --]
On 2018-07-09 18:37, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>>> +{
>>> + int ret;
>>> +
>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>> + if (!ret)
>>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>> + return ret;
>> Is there a good reason you allow userspace to keep emitting unlimited
>> number of commands which actually do not change the status? If not
>> please consider gating the command emission with
>> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
>> could keep toggling ad infinitum with no real work in between. Has it
>> been considered to only save the desired state in set param and then
>> emit it, if needed, before next execbuf? Minor thing in any case, just
>> curious since I wasn't following the threads.
> The first patch tried to add a bit to execbuf, and having been
> mistakenly down that road before, we asked if there was any alternative.
> (Now if you've also been following execbuf3 conversations, having a
> packet for privileged LRI is definitely something we want.)
>
> Setting the value in the context register is precisely what we want to
> do, and trivially serialised with execbuf since we have to serialise
> reservation of ring space, i.e. the normal rules of request generation.
> (execbuf is just a client and nothing special). From that point of view,
> we only care about frequency, if it is very frequent it should be
> controlled by userspace inside the batch (but it can't due to there
> being dangerous bits inside the reg aiui). At the other end of the
> scale, is context_setparam for set-once. And there should be no
> inbetween as that requires costly batch flushes.
> -Chris
Joonas did brought that concern in his review; here it is, with my response:
On 2018-06-21 15:47, Lis, Tomasz wrote:
> On 2018-06-21 08:39, Joonas Lahtinen wrote:
>> I'm thinking we should set this value when it has changed, when we
>> insert the
>> requests into the command stream. So if you change back and forth, while
>> not emitting any requests, nothing really happens. If you change the
>> value and
>> emit a request, we should emit a LRI before the jump to the commands.
>> Similary if you keep setting the value to the value it already was in,
>> nothing will happen, again.
> When I considered that, my way of reasoning was:
> If we execute the flag changing buffer right away, it may be sent to
> hardware faster if there is no job in progress.
> If we use the lazy way, and trigger the change just before submission
> - there will be additional conditions in submission code, plus the
> change will be made when there is another job pending (though it's not
> a considerable payload to just switch a flag).
> If user space switches the flag back and forth without much sense,
> then there is something wrong with the user space driver, and it
> shouldn't be up to kernel to fix that.
>
> This is why I chosen the current approach. But I can change it if you
> wish.
So while I think the current solution is better from performance
standpoint, but I will change it if you request that.
-Tomasz
[-- Attachment #1.2: Type: text/html, Size: 4339 bytes --]
[-- Attachment #2: Type: text/plain, Size: 160 bytes --]
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-10 17:32 ` Lis, Tomasz
@ 2018-07-11 9:28 ` Tvrtko Ursulin
0 siblings, 0 replies; 88+ messages in thread
From: Tvrtko Ursulin @ 2018-07-11 9:28 UTC (permalink / raw)
To: Lis, Tomasz, Chris Wilson, intel-gfx; +Cc: bartosz.dunajski
On 10/07/2018 18:32, Lis, Tomasz wrote:
> On 2018-07-09 18:37, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2018-07-09 17:28:02)
>>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>>> +static int i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>>>> + if (!ret)
>>>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>>>> + return ret;
>>> Is there a good reason you allow userspace to keep emitting unlimited
>>> number of commands which actually do not change the status? If not
>>> please consider gating the command emission with
>>> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
>>> could keep toggling ad infinitum with no real work in between. Has it
>>> been considered to only save the desired state in set param and then
>>> emit it, if needed, before next execbuf? Minor thing in any case, just
>>> curious since I wasn't following the threads.
>> The first patch tried to add a bit to execbuf, and having been
>> mistakenly down that road before, we asked if there was any alternative.
>> (Now if you've also been following execbuf3 conversations, having a
>> packet for privileged LRI is definitely something we want.)
>>
>> Setting the value in the context register is precisely what we want to
>> do, and trivially serialised with execbuf since we have to serialise
>> reservation of ring space, i.e. the normal rules of request generation.
>> (execbuf is just a client and nothing special). From that point of view,
>> we only care about frequency, if it is very frequent it should be
>> controlled by userspace inside the batch (but it can't due to there
>> being dangerous bits inside the reg aiui). At the other end of the
>> scale, is context_setparam for set-once. And there should be no
>> inbetween as that requires costly batch flushes.
>> -Chris
> Joonas did brought that concern in his review; here it is, with my response:
>
> On 2018-06-21 15:47, Lis, Tomasz wrote:
>> On 2018-06-21 08:39, Joonas Lahtinen wrote:
>>> I'm thinking we should set this value when it has changed, when we
>>> insert the
>>> requests into the command stream. So if you change back and forth, while
>>> not emitting any requests, nothing really happens. If you change the
>>> value and
>>> emit a request, we should emit a LRI before the jump to the commands.
>>> Similary if you keep setting the value to the value it already was in,
>>> nothing will happen, again.
>> When I considered that, my way of reasoning was:
>> If we execute the flag changing buffer right away, it may be sent to
>> hardware faster if there is no job in progress.
>> If we use the lazy way, and trigger the change just before submission
>> - there will be additional conditions in submission code, plus the
>> change will be made when there is another job pending (though it's not
>> a considerable payload to just switch a flag).
>> If user space switches the flag back and forth without much sense,
>> then there is something wrong with the user space driver, and it
>> shouldn't be up to kernel to fix that.
>>
>> This is why I chosen the current approach. But I can change it if you
>> wish.
>
> So while I think the current solution is better from performance
> standpoint, but I will change it if you request that.
Sounds like an interesting dilemma and I can see both arguments.
But for me I still prefer the option where coherency programming is
emitted lazily on state change only. We do emit a bunch of pipe controls
to invalidate caches and such as preamble to every request so that fits
nicely. Advantage I see is that the set param ioctl remains very light
and doesn't do any command submission, keeping in spirit and expectation
with all current parameters. It makes the ioctl much quicker and as a
secondary benefit it protects userspace form their own sillyness.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-09 16:28 ` Tvrtko Ursulin
2018-07-09 16:37 ` Chris Wilson
@ 2018-07-10 18:03 ` Lis, Tomasz
2018-07-11 11:20 ` Lis, Tomasz
1 sibling, 1 reply; 88+ messages in thread
From: Lis, Tomasz @ 2018-07-10 18:03 UTC (permalink / raw)
To: Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski
On 2018-07-09 18:28, Tvrtko Ursulin wrote:
>
> On 09/07/2018 14:20, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch
>> data
>> port coherency state is added to the ordered list. All prior requests
>> are
>> executed on old coherency settings, and all exec requests after the
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
>> require
>> memory coherency between CPU and GPU). The goal of coherency
>> enable/disable
>> switch is to remove overhead of memory coherency when memory
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units),
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send"
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This
>> "stateless"
>> model is similar to regular memory load/store operations available on
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>> ________________
>> | NODE1 |
>> | uint64_t data |
>> +----------------|
>> | NODE* | NODE*|
>> +--------+-------+
>> / \
>> ________________/ \________________
>> | NODE2 | | NODE3 |
>> | uint64_t data | | uint64_t data |
>> +----------------| +----------------|
>> | NODE* | NODE*| | NODE* | NODE*|
>> +--------+-------+ +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory
>> locations
>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM -
>> Shared
>> Virtual Memory feature). Using pointers from different allocations
>> doesn't
>> affect the stateless addressing model which even allows scattered
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency.
>> However, for
>> generic OCL applications (such as example with tree-like data
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result,
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that
>> it would be
>> possible to workaround this (e.g. based on allocations map and
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain)
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and
>> processes
>> the payload that kern_master produced. These two kernels work in a
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker
>> should
>> not be forced to pay for it. This means that we need to have the
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>> Introduced a macro for checking whether hardware supports the
>> feature.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.h | 1 +
>> drivers/gpu/drm/i915/i915_gem_context.c | 41
>> +++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/i915_gem_context.h | 6 ++++
>> drivers/gpu/drm/i915/intel_lrc.c | 49
>> +++++++++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>> include/uapi/drm/i915_drm.h | 6 ++++
>> 6 files changed, 107 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 09ab124..7d4bbd5 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
>> *dev_priv)
>> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
>> EDRAM_ENABLED))
>> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
>> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>> #define HWS_NEEDS_PHYSICAL(dev_priv)
>> ((dev_priv)->info.hws_needs_physical)
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..6db352e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -711,6 +711,26 @@ static bool client_is_banned(struct
>> drm_i915_file_private *file_priv)
>> return atomic_read(&file_priv->ban_score) >=
>> I915_CLIENT_SCORE_BANNED;
>> }
>> +static int i915_gem_context_set_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + int ret;
>> +
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, true);
>> + if (!ret)
>> + __set_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>> +}
>> +
>> +static int i915_gem_context_clear_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + int ret;
>> +
>> + ret = intel_lr_context_modify_data_port_coherency(ctx, false);
>> + if (!ret)
>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> + return ret;
>
> Is there a good reason you allow userspace to keep emitting unlimited
> number of commands which actually do not change the status? If not
> please consider gating the command emission with
> test_and_set_bit/test_and_clear_bit. Hm.. apart even with that they
> could keep toggling ad infinitum with no real work in between. Has it
> been considered to only save the desired state in set param and then
> emit it, if needed, before next execbuf? Minor thing in any case, just
> curious since I wasn't following the threads.
>
(discussed further in separate thread)
>> +}
>> +
>> int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>> struct drm_file *file)
>> {
>> @@ -784,6 +804,7 @@ int i915_gem_context_destroy_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *dev_priv = to_i915(dev);
>
> Feel free to use the local for the other existing to_i915(dev) call
> sites in here.
>
> Also use i915 for the local name. Unless I915_READ/WRITE is used i915
> is preferred nowadays.
Will do.
>
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -818,6 +839,12 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> case I915_CONTEXT_PARAM_PRIORITY:
>> args->value = ctx->sched.priority;
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> + ret = -ENODEV;
>> + else
>> + args->value = i915_gem_context_is_data_port_coherent(ctx);
>> + break;
>> default:
>> ret = -EINVAL;
>> break;
>> @@ -830,6 +857,7 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *dev_priv = to_i915(dev);
>
> As with get_param.
Ack.
>
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -893,6 +921,19 @@ int i915_gem_context_setparam_ioctl(struct
>> drm_device *dev, void *data,
>> }
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (args->size)
>> + ret = -EINVAL;
>> + else if (!HAS_DATA_PORT_COHERENCY(dev_priv))
>> + ret = -ENODEV;
>> + else if (args->value == 1)
>> + ret = i915_gem_context_set_data_port_coherent(ctx);
>> + else if (args->value == 0)
>> + ret = i915_gem_context_clear_data_port_coherent(ctx);
>> + else
>> + ret = -EINVAL;
>> + break;
>> +
>> default:
>> ret = -EINVAL;
>> break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..e8ccb70 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,7 @@ struct i915_gem_context {
>> #define CONTEXT_BANNABLE 3
>> #define CONTEXT_BANNED 4
>> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
>> +#define CONTEXT_DATA_PORT_COHERENT 6
>> /**
>> * @hw_id: - unique identifier for the context
>> @@ -257,6 +258,11 @@ static inline void
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>> }
>> +static inline bool i915_gem_context_is_data_port_coherent(struct
>> i915_gem_context *ctx)
>> +{
>> + return test_bit(CONTEXT_DATA_PORT_COHERENT, &ctx->flags);
>> +}
>> +
>> static inline bool i915_gem_context_is_default(const struct
>> i915_gem_context *c)
>> {
>> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index ab89dab..1f037e3 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,55 @@ intel_lr_context_descriptor_update(struct
>> i915_gem_context *ctx,
>> ce->lrc_desc = desc;
>> }
>> +static int emit_set_data_port_coherency(struct i915_request *req,
>> bool enable)
>
> After much disagreement we ended up with rq as the consistent naming
> for requests.
:)
ok.
>
>> +{
>> + u32 *cs;
>> + i915_reg_t reg;
>> +
>> + GEM_BUG_ON(req->engine->class != RENDER_CLASS);
>> + GEM_BUG_ON(INTEL_GEN(req->i915) < 9);
>> +
>> + cs = intel_ring_begin(req, 4);
>> + if (IS_ERR(cs))
>> + return PTR_ERR(cs);
>> +
>> + if (INTEL_GEN(req->i915) >= 10)
>> + reg = CNL_HDC_CHICKEN0;
>> + else
>> + reg = HDC_CHICKEN0;
>> +
>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>> + *cs++ = i915_mmio_reg_offset(reg);
>> + /* Enabling coherency means disabling the bit which forces it
>> off */
>> + if (enable)
>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> + else
>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> + *cs++ = MI_NOOP;
>> +
>> + intel_ring_advance(req, cs);
>> +
>> + return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>> *ctx,
>> + bool enable)
>> +{
>> + struct i915_request *req;
>
> rq as above.
ack
>
>> + int ret;
>> +
>> + req = i915_request_alloc(ctx->i915->engine[RCS], ctx);
>> + if (IS_ERR(req))
>> + return PTR_ERR(req);
>> +
>> + ret = emit_set_data_port_coherency(req, enable);
>> +
>> + i915_request_add(req);
>> +
>> + return ret;
>> +}
>> +
>> static struct i915_priolist *
>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>> {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..f6965ae 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_gem_context
>> *ctx,
>> + bool enable);
>> +
>> #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..e677bea 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,12 @@ struct drm_i915_gem_context_param {
>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send
>> memory
>> + * writes to real RAM faster. Keeping such coherency has performance
>> cost.
>
> Is this comment correct? Is it actually sending memory writes to
> _RAM_, or just the coherency mode enabled, even if only targetting CPU
> or shared cache, which adds a cost?
I'm not sure whether there are further coherency modes to choose how
"deep" coherency goes. The use case of OCL Team is to see gradual
changes in the buffers on CPU side while the execution progresses. Write
to RAM is needed to achieve that. And that limits performance by using
RAM bandwidth.
>
> s/Keeping such coherency has performance cost./Enabling data port
> coherency has a performance cost./ ? Or "can have a performance cost"?
I would prefer "Enabling data port coherency has a performance cost.".
There likely are workloads with unmeasureable performance impact, but in
real world using more memory writes will always slow down something.
>
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>> __u64 value;
>> };
>>
>
> Since I understand this design has been approved already on the high
> level, and as you can see I only had some minor comments to add, I can
> say that the patch in principle looks okay to me.
Great; will produce a v5 soon.
-Tomasz
>
> Regards,
>
> Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v4] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-10 18:03 ` Lis, Tomasz
@ 2018-07-11 11:20 ` Lis, Tomasz
0 siblings, 0 replies; 88+ messages in thread
From: Lis, Tomasz @ 2018-07-11 11:20 UTC (permalink / raw)
To: Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski
On 2018-07-10 20:03, Lis, Tomasz wrote:
>
>
> On 2018-07-09 18:28, Tvrtko Ursulin wrote:
>>
>> On 09/07/2018 14:20, Tomasz Lis wrote:
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>>> b/drivers/gpu/drm/i915/intel_lrc.h
>>> index 1593194..f6965ae 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>>> [...]
>>> +/*
>>> + * When data port level coherency is enabled, the GPU will update
>>> memory
>>> + * buffers shared with CPU, by forcing internal cache units to send
>>> memory
>>> + * writes to real RAM faster. Keeping such coherency has
>>> performance cost.
>>
>> Is this comment correct? Is it actually sending memory writes to
>> _RAM_, or just the coherency mode enabled, even if only targetting
>> CPU or shared cache, which adds a cost?
> I'm not sure whether there are further coherency modes to choose how
> "deep" coherency goes. The use case of OCL Team is to see gradual
> changes in the buffers on CPU side while the execution progresses.
> Write to RAM is needed to achieve that. And that limits performance by
> using RAM bandwidth.
It was pointed out to me that last level cache is shared between CPU and
GPU on non-atoms. Which means my argument was invalid, an most likely
the coherency option does not enforce RAM write. I will update the comment.
-Tomasz
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
` (4 preceding siblings ...)
2018-07-09 13:20 ` [PATCH v4] " Tomasz Lis
@ 2018-07-12 15:10 ` Tomasz Lis
2018-07-13 10:40 ` Tvrtko Ursulin
2018-10-09 18:06 ` [PATCH v6] " Tomasz Lis
2018-10-12 15:02 ` [PATCH v8] " Tomasz Lis
7 siblings, 1 reply; 88+ messages in thread
From: Tomasz Lis @ 2018-07-12 15:10 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).
When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
Removed redundant GEM_WARN_ON()s. Improved to coding standard.
Introduced a macro for checking whether hardware supports the feature.
v5: Renamed some locals. Made the flag write to be lazy.
Updated comments to remove misconceptions. Added gen11 support.
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 1 +
drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++---
drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
drivers/gpu/drm/i915/i915_gem_execbuffer.c | 6 ++++
drivers/gpu/drm/i915/intel_lrc.c | 55 ++++++++++++++++++++++++++++++
drivers/gpu/drm/i915/intel_lrc.h | 4 +++
include/uapi/drm/i915_drm.h | 7 ++++
7 files changed, 115 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 01dd298..73192e1 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
#define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
#define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
#define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index b10770c..b5b63ac 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *i915 = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_GTT_SIZE:
if (ctx->ppgtt)
args->value = ctx->ppgtt->vm.total;
- else if (to_i915(dev)->mm.aliasing_ppgtt)
- args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+ else if (i915->mm.aliasing_ppgtt)
+ args->value = i915->mm.aliasing_ppgtt->vm.total;
else
- args->value = to_i915(dev)->ggtt.vm.total;
+ args->value = i915->ggtt.vm.total;
break;
case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
args->value = i915_gem_context_no_error_capture(ctx);
@@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_PRIORITY:
args->value = ctx->sched.priority;
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (!HAS_DATA_PORT_COHERENCY(i915))
+ ret = -ENODEV;
+ else
+ args->value = i915_gem_context_is_data_port_coherent_requested(ctx);
+ break;
default:
ret = -EINVAL;
break;
@@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *i915 = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
if (args->size)
ret = -EINVAL;
- else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
+ else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
ret = -ENODEV;
else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
priority < I915_CONTEXT_MIN_USER_PRIORITY)
@@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
}
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (args->size)
+ ret = -EINVAL;
+ else if (!HAS_DATA_PORT_COHERENCY(i915))
+ ret = -ENODEV;
+ else if (args->value == 1)
+ i915_gem_context_set_data_port_coherent_requested(ctx);
+ else if (args->value == 0)
+ i915_gem_context_clear_data_port_coherent_requested(ctx);
+ else
+ ret = -EINVAL;
+ break;
+
default:
ret = -EINVAL;
break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b116e49..826af84 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -126,6 +126,8 @@ struct i915_gem_context {
#define CONTEXT_BANNABLE 3
#define CONTEXT_BANNED 4
#define CONTEXT_FORCE_SINGLE_SUBMISSION 5
+#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 6
+#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 7
/**
* @hw_id: - unique identifier for the context
@@ -257,6 +259,21 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
__set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
}
+static inline bool i915_gem_context_is_data_port_coherent_requested(struct i915_gem_context *ctx)
+{
+ return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_set_data_port_coherent_requested(struct i915_gem_context *ctx)
+{
+ __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_clear_data_port_coherent_requested(struct i915_gem_context *ctx)
+{
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
{
return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 3f0c612..64a7cd4 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -2361,6 +2361,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
goto err_batch_unpin;
}
+ /* Emit the switch of data port coherency state if needed */
+ err = intel_lr_context_modify_data_port_coherency(eb.request,
+ i915_gem_context_is_data_port_coherent_requested(eb.ctx));
+ if (GEM_WARN_ON(err))
+ DRM_DEBUG("Data Port Coherency toggle failed, keeping old setting.\n");
+
if (in_fence) {
err = i915_request_await_dma_fence(eb.request, in_fence);
if (err < 0)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 35d37af..fcee03d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,61 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
ce->lrc_desc = desc;
}
+static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
+{
+ u32 *cs;
+ i915_reg_t reg;
+
+ GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
+ GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
+
+ cs = intel_ring_begin(rq, 4);
+ if (IS_ERR(cs))
+ return PTR_ERR(cs);
+
+ if (INTEL_GEN(rq->i915) >= 11)
+ reg = ICL_HDC_MODE;
+ else if (INTEL_GEN(rq->i915) >= 10)
+ reg = CNL_HDC_CHICKEN0;
+ else
+ reg = HDC_CHICKEN0;
+
+ *cs++ = MI_LOAD_REGISTER_IMM(1);
+ *cs++ = i915_mmio_reg_offset(reg);
+ /* Enabling coherency means disabling the bit which forces it off */
+ if (enable)
+ *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+ else
+ *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+ *cs++ = MI_NOOP;
+
+ intel_ring_advance(rq, cs);
+
+ return 0;
+}
+
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
+ bool enable)
+{
+ struct i915_gem_context *ctx = rq->gem_context;
+ int ret;
+
+ if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
+ return 0;
+
+ ret = emit_set_data_port_coherency(rq, enable);
+
+ if (!ret) {
+ if (enable)
+ __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+ else
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+ }
+
+ return ret;
+}
+
static struct i915_priolist *
lookup_priolist(struct intel_engine_cs *engine, int prio)
{
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1593194..20e8664 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,8 @@ struct i915_gem_context;
void intel_lr_context_resume(struct drm_i915_private *dev_priv);
+int
+intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
+ bool enable);
+
#endif /* _INTEL_LRC_H_ */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7f5634c..0a4e31f 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
#define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
#define I915_CONTEXT_DEFAULT_PRIORITY 0
#define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU will update memory
+ * buffers shared with CPU, by forcing internal cache units to send memory
+ * writes to higher level caches faster. Enabling data port coherency has
+ * performance cost.
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
__u64 value;
};
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-12 15:10 ` [PATCH v5] " Tomasz Lis
@ 2018-07-13 10:40 ` Tvrtko Ursulin
2018-07-13 17:44 ` Lis, Tomasz
0 siblings, 1 reply; 88+ messages in thread
From: Tvrtko Ursulin @ 2018-07-13 10:40 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
On 12/07/2018 16:10, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
> Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
> Updated comments to remove misconceptions. Added gen11 support.
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 1 +
> drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++---
> drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
> drivers/gpu/drm/i915/i915_gem_execbuffer.c | 6 ++++
> drivers/gpu/drm/i915/intel_lrc.c | 55 ++++++++++++++++++++++++++++++
> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
> include/uapi/drm/i915_drm.h | 7 ++++
> 7 files changed, 115 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 01dd298..73192e1 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private *dev_priv)
> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index b10770c..b5b63ac 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_GTT_SIZE:
> if (ctx->ppgtt)
> args->value = ctx->ppgtt->vm.total;
> - else if (to_i915(dev)->mm.aliasing_ppgtt)
> - args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> + else if (i915->mm.aliasing_ppgtt)
> + args->value = i915->mm.aliasing_ppgtt->vm.total;
> else
> - args->value = to_i915(dev)->ggtt.vm.total;
> + args->value = i915->ggtt.vm.total;
> break;
> case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
> args->value = i915_gem_context_no_error_capture(ctx);
> @@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority;
> break;
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent_requested(ctx);
Feels a bit like overly long name so maybe drop the _requested suffix
but a suggestion only.
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>
> if (args->size)
> ret = -EINVAL;
> - else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> + else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> ret = -ENODEV;
> else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
> priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else if (args->value == 1)
> + i915_gem_context_set_data_port_coherent_requested(ctx);
> + else if (args->value == 0)
> + i915_gem_context_clear_data_port_coherent_requested(ctx);
> + else
> + ret = -EINVAL;
> + break;
> +
> default:
> ret = -EINVAL;
> break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index b116e49..826af84 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -126,6 +126,8 @@ struct i915_gem_context {
> #define CONTEXT_BANNABLE 3
> #define CONTEXT_BANNED 4
> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 6
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 7
>
> /**
> * @hw_id: - unique identifier for the context
> @@ -257,6 +259,21 @@ static inline void i915_gem_context_set_force_single_submission(struct i915_gem_
> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
> }
>
> +static inline bool i915_gem_context_is_data_port_coherent_requested(struct i915_gem_context *ctx)
> +{
> + return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent_requested(struct i915_gem_context *ctx)
> +{
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent_requested(struct i915_gem_context *ctx)
> +{
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
> {
> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 3f0c612..64a7cd4 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -2361,6 +2361,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
> goto err_batch_unpin;
> }
>
> + /* Emit the switch of data port coherency state if needed */
> + err = intel_lr_context_modify_data_port_coherency(eb.request,
> + i915_gem_context_is_data_port_coherent_requested(eb.ctx));
> + if (GEM_WARN_ON(err))
> + DRM_DEBUG("Data Port Coherency toggle failed, keeping old setting.\n");
I think we should propagate the error to userspace here. By the virtue
of MIN_SPACE_FOR_ADD_REQUEST* we guarantee there must be space for
request emission.
GEM_WARN_ON is therefore okay to let us know we got the value of
MIN_SPACE_FOR_ADD_REQUEST wrong. Just remove the "keeping old setting"
from the debug message.
* Having looked at the commit which last increased
MIN_SPACE_FOR_ADD_REQUEST I suspect the current value is large enough
for this addition and that we could probably look at decreasing it. It
is a manual process though so not straightforward.
But also since this is >= GEN9 code I think it needs to be done deeper.
Like in the backend layer sounds right to me.
Maybe intel_lrc.c/gen8_emit_flush_render in the EMIT_INVALIDATE mode?
That is the request preamble dealing with invalidating caches so
modifying cache coherency mode there as well sounds like a fit to me.
> +
> if (in_fence) {
> err = i915_request_await_dma_fence(eb.request, in_fence);
> if (err < 0)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 35d37af..fcee03d 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,61 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> + cs = intel_ring_begin(rq, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(rq->i915) >= 11)
> + reg = ICL_HDC_MODE;
> + else if (INTEL_GEN(rq->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + /* Enabling coherency means disabling the bit which forces it off */
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(rq, cs);
> +
> + return 0;
> +}
> +
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
> + bool enable)
> +{
> + struct i915_gem_context *ctx = rq->gem_context;
> + int ret;
> +
I'd put a lockdep_assert_held on struct_mutex here to mark it up for the
future.
> + if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
You don't need to pass in enable to this function since it can figure
out what to do from the flags on its own:
if ((ctx->flags & REQUESTED) == (ctx->flags & ACTIVE))
return 0;
After which functions should proabbly be renamed to
intel_lr_context_update_data_port_coherency?
> + return 0;
> +
> + ret = emit_set_data_port_coherency(rq, enable);
And then:
..(rq, ctx->flags & REQUESTED)
> +
> + if (!ret) {
> + if (enable)
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + else
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + }
> +
> + return ret;
> +}
> +
> static struct i915_priolist *
> lookup_priolist(struct intel_engine_cs *engine, int prio)
> {
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 1593194..20e8664 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,8 @@ struct i915_gem_context;
>
> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>
> +int
> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
> + bool enable);
> +
> #endif /* _INTEL_LRC_H_ */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 7f5634c..0a4e31f 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU will update memory
> + * buffers shared with CPU, by forcing internal cache units to send memory
> + * writes to higher level caches faster. Enabling data port coherency has
> + * performance cost.
"has _a_ performance cost" I think but not a native speaker so might be
wrong.
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
> __u64 value;
> };
>
>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.
2018-07-13 10:40 ` Tvrtko Ursulin
@ 2018-07-13 17:44 ` Lis, Tomasz
0 siblings, 0 replies; 88+ messages in thread
From: Lis, Tomasz @ 2018-07-13 17:44 UTC (permalink / raw)
To: Tvrtko Ursulin, intel-gfx; +Cc: bartosz.dunajski
On 2018-07-13 12:40, Tvrtko Ursulin wrote:
>
> On 12/07/2018 16:10, Tomasz Lis wrote:
>> The patch adds a parameter to control the data port coherency
>> functionality
>> on a per-context level. When the IOCTL is called, a command to switch
>> data
>> port coherency state is added to the ordered list. All prior requests
>> are
>> executed on old coherency settings, and all exec requests after the
>> IOCTL
>> will use new settings.
>>
>> Rationale:
>>
>> The OpenCL driver develpers requested a functionality to control cache
>> coherency at data port level. Keeping the coherency at that level is
>> disabled
>> by default due to its performance costs. OpenCL driver is planning to
>> enable it for a small subset of submissions, when such functionality is
>> required. Below are answers to basic question explaining background
>> of the functionality and reasoning for the proposed implementation:
>>
>> 1. Why do we need a coherency enable/disable switch for memory that
>> is shared
>> between CPU and GEN (GPU)?
>>
>> Memory coherency between CPU and GEN, while being a great feature
>> that enables
>> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN
>> architecture, adds
>> overhead related to tracking (snooping) memory inside different cache
>> units
>> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
>> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence
>> require
>> memory coherency between CPU and GPU). The goal of coherency
>> enable/disable
>> switch is to remove overhead of memory coherency when memory
>> coherency is not
>> needed.
>>
>> 2. Why do we need a global coherency switch?
>>
>> In order to support I/O commands from within EUs (Execution Units),
>> Intel GEN
>> ISA (GEN Instruction Set Assembly) contains dedicated "send"
>> instructions.
>> These send instructions provide several addressing models. One of these
>> addressing models (named "stateless") provides most flexible I/O
>> using plain
>> virtual addresses (as opposed to buffer_handle+offset models). This
>> "stateless"
>> model is similar to regular memory load/store operations available on
>> typical
>> CPUs. Since this model provides I/O using arbitrary virtual
>> addresses, it
>> enables algorithmic designs that are based on pointer-to-pointer
>> (e.g. buffer
>> of pointers) concepts. For instance, it allows creating tree-like data
>> structures such as:
>> ________________
>> | NODE1 |
>> | uint64_t data |
>> +----------------|
>> | NODE* | NODE*|
>> +--------+-------+
>> / \
>> ________________/ \________________
>> | NODE2 | | NODE3 |
>> | uint64_t data | | uint64_t data |
>> +----------------| +----------------|
>> | NODE* | NODE*| | NODE* | NODE*|
>> +--------+-------+ +--------+-------+
>>
>> Please note that pointers inside such structures can point to memory
>> locations
>> in different OCL allocations - e.g. NODE1 and NODE2 can reside in
>> one OCL
>> allocation while NODE3 resides in a completely separate OCL allocation.
>> Additionally, such pointers can be shared with CPU (i.e. using SVM -
>> Shared
>> Virtual Memory feature). Using pointers from different allocations
>> doesn't
>> affect the stateless addressing model which even allows scattered
>> reading from
>> different allocations at the same time (i.e. by utilizing SIMD-nature
>> of send
>> instructions).
>>
>> When it comes to coherency programming, send instructions in
>> stateless model
>> can be encoded (at ISA level) to either use or disable coherency.
>> However, for
>> generic OCL applications (such as example with tree-like data
>> structure), OCL
>> compiler is not able to determine origin of memory pointed to by an
>> arbitrary
>> pointer - i.e. is not able to track given pointer back to a specific
>> allocation. As such, it's not able to decide whether coherency is
>> needed or not
>> for specific pointer (or for specific I/O instruction). As a result,
>> compiler
>> encodes all stateless sends as coherent (doing otherwise would lead to
>> functional issues resulting from data corruption). Please note that
>> it would be
>> possible to workaround this (e.g. based on allocations map and
>> pointer bounds
>> checking prior to each I/O instruction) but the performance cost of such
>> workaround would be many times greater than the cost of keeping
>> coherency
>> always enabled. As such, enabling/disabling memory coherency at GEN
>> ISA level
>> is not feasible and alternative method is needed.
>>
>> Such alternative solution is to have a global coherency switch that
>> allows
>> disabling coherency for single (though entire) GPU submission. This is
>> beneficial because this way we:
>> * can enable (and pay for) coherency only in submissions that
>> actually need
>> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
>> * don't care about coherency at GEN ISA granularity (no performance
>> impact)
>>
>> 3. Will coherency switch be used frequently?
>>
>> There are scenarios that will require frequent toggling of the coherency
>> switch.
>> E.g. an application has two OCL compute kernels: kern_master and
>> kern_worker.
>> kern_master uses, concurrently with CPU, some fine grain SVM resources
>> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
>> computational work that needs to be executed. kern_master analyzes
>> incoming
>> work descriptors and populates a plain OCL buffer (non-fine-grain)
>> with payload
>> for kern_worker. Once kern_master is done, kern_worker kicks-in and
>> processes
>> the payload that kern_master produced. These two kernels work in a
>> loop, one
>> after another. Since only kern_master requires coherency, kern_worker
>> should
>> not be forced to pay for it. This means that we need to have the
>> ability to
>> toggle coherency switch on or off per each GPU submission:
>> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker ->
>> (ENABLE
>> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>>
>> v2: Fixed compilation warning.
>> v3: Refactored the patch to add IOCTL instead of exec flag.
>> v4: Renamed and documented the API flag. Used strict values.
>> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
>> Introduced a macro for checking whether hardware supports the
>> feature.
>> v5: Renamed some locals. Made the flag write to be lazy.
>> Updated comments to remove misconceptions. Added gen11 support.
>>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Michal Winiarski <michal.winiarski@intel.com>
>>
>> Bspec: 11419
>> Bspec: 19175
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.h | 1 +
>> drivers/gpu/drm/i915/i915_gem_context.c | 29 +++++++++++++---
>> drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
>> drivers/gpu/drm/i915/i915_gem_execbuffer.c | 6 ++++
>> drivers/gpu/drm/i915/intel_lrc.c | 55
>> ++++++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>> include/uapi/drm/i915_drm.h | 7 ++++
>> 7 files changed, 115 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 01dd298..73192e1 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2524,6 +2524,7 @@ intel_info(const struct drm_i915_private
>> *dev_priv)
>> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap &
>> EDRAM_ENABLED))
>> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
>> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
>> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>> #define HWS_NEEDS_PHYSICAL(dev_priv)
>> ((dev_priv)->info.hws_needs_physical)
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index b10770c..b5b63ac 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -784,6 +784,7 @@ int i915_gem_context_destroy_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *i915 = to_i915(dev);
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -804,10 +805,10 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> case I915_CONTEXT_PARAM_GTT_SIZE:
>> if (ctx->ppgtt)
>> args->value = ctx->ppgtt->vm.total;
>> - else if (to_i915(dev)->mm.aliasing_ppgtt)
>> - args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
>> + else if (i915->mm.aliasing_ppgtt)
>> + args->value = i915->mm.aliasing_ppgtt->vm.total;
>> else
>> - args->value = to_i915(dev)->ggtt.vm.total;
>> + args->value = i915->ggtt.vm.total;
>> break;
>> case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
>> args->value = i915_gem_context_no_error_capture(ctx);
>> @@ -818,6 +819,12 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> case I915_CONTEXT_PARAM_PRIORITY:
>> args->value = ctx->sched.priority;
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (!HAS_DATA_PORT_COHERENCY(i915))
>> + ret = -ENODEV;
>> + else
>> + args->value =
>> i915_gem_context_is_data_port_coherent_requested(ctx);
>
> Feels a bit like overly long name so maybe drop the _requested suffix
> but a suggestion only.
I was considering this as well; will do.
>
>> + break;
>> default:
>> ret = -EINVAL;
>> break;
>> @@ -830,6 +837,7 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void
>> *data,
>> struct drm_file *file)
>> {
>> + struct drm_i915_private *i915 = to_i915(dev);
>> struct drm_i915_file_private *file_priv = file->driver_priv;
>> struct drm_i915_gem_context_param *args = data;
>> struct i915_gem_context *ctx;
>> @@ -880,7 +888,7 @@ int i915_gem_context_setparam_ioctl(struct
>> drm_device *dev, void *data,
>> if (args->size)
>> ret = -EINVAL;
>> - else if (!(to_i915(dev)->caps.scheduler &
>> I915_SCHEDULER_CAP_PRIORITY))
>> + else if (!(i915->caps.scheduler &
>> I915_SCHEDULER_CAP_PRIORITY))
>> ret = -ENODEV;
>> else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
>> priority < I915_CONTEXT_MIN_USER_PRIORITY)
>> @@ -893,6 +901,19 @@ int i915_gem_context_setparam_ioctl(struct
>> drm_device *dev, void *data,
>> }
>> break;
>> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
>> + if (args->size)
>> + ret = -EINVAL;
>> + else if (!HAS_DATA_PORT_COHERENCY(i915))
>> + ret = -ENODEV;
>> + else if (args->value == 1)
>> + i915_gem_context_set_data_port_coherent_requested(ctx);
>> + else if (args->value == 0)
>> + i915_gem_context_clear_data_port_coherent_requested(ctx);
>> + else
>> + ret = -EINVAL;
>> + break;
>> +
>> default:
>> ret = -EINVAL;
>> break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h
>> b/drivers/gpu/drm/i915/i915_gem_context.h
>> index b116e49..826af84 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -126,6 +126,8 @@ struct i915_gem_context {
>> #define CONTEXT_BANNABLE 3
>> #define CONTEXT_BANNED 4
>> #define CONTEXT_FORCE_SINGLE_SUBMISSION 5
>> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 6
>> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 7
>> /**
>> * @hw_id: - unique identifier for the context
>> @@ -257,6 +259,21 @@ static inline void
>> i915_gem_context_set_force_single_submission(struct i915_gem_
>> __set_bit(CONTEXT_FORCE_SINGLE_SUBMISSION, &ctx->flags);
>> }
>> +static inline bool
>> i915_gem_context_is_data_port_coherent_requested(struct
>> i915_gem_context *ctx)
>> +{
>> + return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>> +static inline void
>> i915_gem_context_set_data_port_coherent_requested(struct
>> i915_gem_context *ctx)
>> +{
>> + __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>> +static inline void
>> i915_gem_context_clear_data_port_coherent_requested(struct
>> i915_gem_context *ctx)
>> +{
>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
>> +}
>> +
>> static inline bool i915_gem_context_is_default(const struct
>> i915_gem_context *c)
>> {
>> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> index 3f0c612..64a7cd4 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> @@ -2361,6 +2361,12 @@ i915_gem_do_execbuffer(struct drm_device *dev,
>> goto err_batch_unpin;
>> }
>> + /* Emit the switch of data port coherency state if needed */
>> + err = intel_lr_context_modify_data_port_coherency(eb.request,
>> + i915_gem_context_is_data_port_coherent_requested(eb.ctx));
>> + if (GEM_WARN_ON(err))
>> + DRM_DEBUG("Data Port Coherency toggle failed, keeping old
>> setting.\n");
>
> I think we should propagate the error to userspace here. By the virtue
> of MIN_SPACE_FOR_ADD_REQUEST* we guarantee there must be space for
> request emission.
>
> GEM_WARN_ON is therefore okay to let us know we got the value of
> MIN_SPACE_FOR_ADD_REQUEST wrong. Just remove the "keeping old setting"
> from the debug message.
ack
>
> * Having looked at the commit which last increased
> MIN_SPACE_FOR_ADD_REQUEST I suspect the current value is large enough
> for this addition and that we could probably look at decreasing it. It
> is a manual process though so not straightforward.
>
> But also since this is >= GEN9 code I think it needs to be done
> deeper. Like in the backend layer sounds right to me.
>
> Maybe intel_lrc.c/gen8_emit_flush_render in the EMIT_INVALIDATE mode?
> That is the request preamble dealing with invalidating caches so
> modifying cache coherency mode there as well sounds like a fit to me.
>
Agreed. Will move.
>> +
>> if (in_fence) {
>> err = i915_request_await_dma_fence(eb.request, in_fence);
>> if (err < 0)
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
>> b/drivers/gpu/drm/i915/intel_lrc.c
>> index 35d37af..fcee03d 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -259,6 +259,61 @@ intel_lr_context_descriptor_update(struct
>> i915_gem_context *ctx,
>> ce->lrc_desc = desc;
>> }
>> +static int emit_set_data_port_coherency(struct i915_request *rq,
>> bool enable)
>> +{
>> + u32 *cs;
>> + i915_reg_t reg;
>> +
>> + GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
>> + GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
>> +
>> + cs = intel_ring_begin(rq, 4);
>> + if (IS_ERR(cs))
>> + return PTR_ERR(cs);
>> +
>> + if (INTEL_GEN(rq->i915) >= 11)
>> + reg = ICL_HDC_MODE;
>> + else if (INTEL_GEN(rq->i915) >= 10)
>> + reg = CNL_HDC_CHICKEN0;
>> + else
>> + reg = HDC_CHICKEN0;
>> +
>> + *cs++ = MI_LOAD_REGISTER_IMM(1);
>> + *cs++ = i915_mmio_reg_offset(reg);
>> + /* Enabling coherency means disabling the bit which forces it
>> off */
>> + if (enable)
>> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
>> + else
>> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
>> + *cs++ = MI_NOOP;
>> +
>> + intel_ring_advance(rq, cs);
>> +
>> + return 0;
>> +}
>> +
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
>> + bool enable)
>> +{
>> + struct i915_gem_context *ctx = rq->gem_context;
>> + int ret;
>> +
>
> I'd put a lockdep_assert_held on struct_mutex here to mark it up for
> the future.
ok, will do.
>
>> + if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) ==
>> enable)
>
> You don't need to pass in enable to this function since it can figure
> out what to do from the flags on its own:
>
> if ((ctx->flags & REQUESTED) == (ctx->flags & ACTIVE))
> return 0;
>
> After which functions should proabbly be renamed to
> intel_lr_context_update_data_port_coherency?
>
ack
>> + return 0;
>> +
>> + ret = emit_set_data_port_coherency(rq, enable);
>
> And then:
>
> ..(rq, ctx->flags & REQUESTED)
ok, I will use a local though.
>
>> +
>> + if (!ret) {
>> + if (enable)
>> + __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
>> + else
>> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE,
>> &ctx->flags);
>> + }
>> +
>> + return ret;
>> +}
>> +
>> static struct i915_priolist *
>> lookup_priolist(struct intel_engine_cs *engine, int prio)
>> {
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
>> b/drivers/gpu/drm/i915/intel_lrc.h
>> index 1593194..20e8664 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.h
>> +++ b/drivers/gpu/drm/i915/intel_lrc.h
>> @@ -104,4 +104,8 @@ struct i915_gem_context;
>> void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>> +int
>> +intel_lr_context_modify_data_port_coherency(struct i915_request *rq,
>> + bool enable);
>> +
>> #endif /* _INTEL_LRC_H_ */
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 7f5634c..0a4e31f 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1456,6 +1456,13 @@ struct drm_i915_gem_context_param {
>> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
>> #define I915_CONTEXT_DEFAULT_PRIORITY 0
>> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
>> +/*
>> + * When data port level coherency is enabled, the GPU will update
>> memory
>> + * buffers shared with CPU, by forcing internal cache units to send
>> memory
>> + * writes to higher level caches faster. Enabling data port
>> coherency has
>> + * performance cost.
>
> "has _a_ performance cost" I think but not a native speaker so might
> be wrong.
Agreed.
Will send the update as soon as it's tested.
>
>> + */
>> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
>> __u64 value;
>> };
>>
>
> Regards,
>
> Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* [PATCH v6] drm/i915: Add IOCTL Param to control data port coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
` (5 preceding siblings ...)
2018-07-12 15:10 ` [PATCH v5] " Tomasz Lis
@ 2018-10-09 18:06 ` Tomasz Lis
2018-10-10 7:29 ` Tvrtko Ursulin
2018-10-12 15:02 ` [PATCH v8] " Tomasz Lis
7 siblings, 1 reply; 88+ messages in thread
From: Tomasz Lis @ 2018-10-09 18:06 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).
When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
Removed redundant GEM_WARN_ON()s. Improved to coding standard.
Introduced a macro for checking whether hardware supports the feature.
v5: Renamed some locals. Made the flag write to be lazy.
Updated comments to remove misconceptions. Added gen11 support.
v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
Moved all flags checking to one place. Added mutex check.
v7: Removed 2 comments, improved API comment. (Joonas)
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 1 +
drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
drivers/gpu/drm/i915/intel_lrc.c | 64 ++++++++++++++++++++++++++++++++-
include/uapi/drm/i915_drm.h | 10 ++++++
5 files changed, 116 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 794a8a0..e1ea5cb 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
#define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
#define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
#define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8cbe580..718ede9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *i915 = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_GTT_SIZE:
if (ctx->ppgtt)
args->value = ctx->ppgtt->vm.total;
- else if (to_i915(dev)->mm.aliasing_ppgtt)
- args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+ else if (i915->mm.aliasing_ppgtt)
+ args->value = i915->mm.aliasing_ppgtt->vm.total;
else
- args->value = to_i915(dev)->ggtt.vm.total;
+ args->value = i915->ggtt.vm.total;
break;
case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
args->value = i915_gem_context_no_error_capture(ctx);
@@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_PRIORITY:
args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (!HAS_DATA_PORT_COHERENCY(i915))
+ ret = -ENODEV;
+ else
+ args->value = i915_gem_context_is_data_port_coherent(ctx);
+ break;
default:
ret = -EINVAL;
break;
@@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *i915 = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
if (args->size)
ret = -EINVAL;
- else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
+ else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
ret = -ENODEV;
else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
priority < I915_CONTEXT_MIN_USER_PRIORITY)
@@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
}
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (args->size)
+ ret = -EINVAL;
+ else if (!HAS_DATA_PORT_COHERENCY(i915))
+ ret = -ENODEV;
+ else if (args->value == 1)
+ i915_gem_context_set_data_port_coherent(ctx);
+ else if (args->value == 0)
+ i915_gem_context_clear_data_port_coherent(ctx);
+ else
+ ret = -EINVAL;
+ break;
+
default:
ret = -EINVAL;
break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index f6d870b..55969bc 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -131,6 +131,8 @@ struct i915_gem_context {
#define CONTEXT_BANNED 0
#define CONTEXT_CLOSED 1
#define CONTEXT_FORCE_SINGLE_SUBMISSION 2
+#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 6
+#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 7
/**
* @hw_id: - unique identifier for the context
@@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
atomic_dec(&ctx->hw_id_pin_count);
}
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+ return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+ __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
{
return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ff0e2b3..313fb72 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
ce->lrc_desc = desc;
}
+static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
+{
+ u32 *cs;
+ i915_reg_t reg;
+
+ GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
+ GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
+
+ cs = intel_ring_begin(rq, 4);
+ if (IS_ERR(cs))
+ return PTR_ERR(cs);
+
+ if (INTEL_GEN(rq->i915) >= 11)
+ reg = ICL_HDC_MODE;
+ else if (INTEL_GEN(rq->i915) >= 10)
+ reg = CNL_HDC_CHICKEN0;
+ else
+ reg = HDC_CHICKEN0;
+
+ *cs++ = MI_LOAD_REGISTER_IMM(1);
+ *cs++ = i915_mmio_reg_offset(reg);
+ if (enable)
+ *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+ else
+ *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+ *cs++ = MI_NOOP;
+
+ intel_ring_advance(rq, cs);
+
+ return 0;
+}
+
+static int
+intel_lr_context_update_data_port_coherency(struct i915_request *rq)
+{
+ struct i915_gem_context *ctx = rq->gem_context;
+ bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+ int ret;
+
+ lockdep_assert_held(&rq->i915->drm.struct_mutex);
+
+ if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
+ return 0;
+
+ ret = emit_set_data_port_coherency(rq, enable);
+
+ if (!ret) {
+ if (enable)
+ __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+ else
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+ }
+
+ return ret;
+}
+
static void unwind_wa_tail(struct i915_request *rq)
{
rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
@@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
bool vf_flush_wa = false, dc_flush_wa = false;
u32 *cs, flags = 0;
- int len;
+ int err, len;
flags |= PIPE_CONTROL_CS_STALL;
@@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
/* WaForGAMHang:kbl */
if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
dc_flush_wa = true;
+
+ err = intel_lr_context_update_data_port_coherency(request);
+ if (GEM_WARN_ON(err)) {
+ DRM_DEBUG("Data Port Coherency toggle failed.\n");
+ return err;
+ }
}
len = 6;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 298b2e1..8f8211b 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
#define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
#define I915_CONTEXT_DEFAULT_PRIORITY 0
#define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU and CPU will both keep
+ * changes to memory content visible to each other as fast as possible, by
+ * forcing internal cache units to send memory writes to higher level caches
+ * immediatelly after writes. Only buffers with coherency requested within
+ * surface state, or specific stateless accesses will be affected by this
+ * option. Keeping data port coherency has a performance cost, and therefore
+ * it is by default disabled (see WaForceEnableNonCoherent).
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
__u64 value;
};
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [PATCH v6] drm/i915: Add IOCTL Param to control data port coherency.
2018-10-09 18:06 ` [PATCH v6] " Tomasz Lis
@ 2018-10-10 7:29 ` Tvrtko Ursulin
0 siblings, 0 replies; 88+ messages in thread
From: Tvrtko Ursulin @ 2018-10-10 7:29 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
On 09/10/2018 19:06, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
> Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
> Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
> Moved all flags checking to one place. Added mutex check.
> v7: Removed 2 comments, improved API comment. (Joonas)
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 1 +
> drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
> drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
> drivers/gpu/drm/i915/intel_lrc.c | 64 ++++++++++++++++++++++++++++++++-
> include/uapi/drm/i915_drm.h | 10 ++++++
> 5 files changed, 116 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 794a8a0..e1ea5cb 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8cbe580..718ede9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_GTT_SIZE:
> if (ctx->ppgtt)
> args->value = ctx->ppgtt->vm.total;
> - else if (to_i915(dev)->mm.aliasing_ppgtt)
> - args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> + else if (i915->mm.aliasing_ppgtt)
> + args->value = i915->mm.aliasing_ppgtt->vm.total;
> else
> - args->value = to_i915(dev)->ggtt.vm.total;
> + args->value = i915->ggtt.vm.total;
> break;
> case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
> args->value = i915_gem_context_no_error_capture(ctx);
> @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
> break;
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent(ctx);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>
> if (args->size)
> ret = -EINVAL;
> - else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> + else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> ret = -ENODEV;
> else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
> priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else if (args->value == 1)
> + i915_gem_context_set_data_port_coherent(ctx);
> + else if (args->value == 0)
> + i915_gem_context_clear_data_port_coherent(ctx);
> + else
> + ret = -EINVAL;
> + break;
> +
> default:
> ret = -EINVAL;
> break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b..55969bc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -131,6 +131,8 @@ struct i915_gem_context {
> #define CONTEXT_BANNED 0
> #define CONTEXT_CLOSED 1
> #define CONTEXT_FORCE_SINGLE_SUBMISSION 2
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 6
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 7
>
> /**
> * @hw_id: - unique identifier for the context
> @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
> atomic_dec(&ctx->hw_id_pin_count);
> }
>
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
> {
> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ff0e2b3..313fb72 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> + cs = intel_ring_begin(rq, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(rq->i915) >= 11)
> + reg = ICL_HDC_MODE;
> + else if (INTEL_GEN(rq->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(rq, cs);
> +
> + return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> + struct i915_gem_context *ctx = rq->gem_context;
> + bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> + int ret;
> +
> + lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> + if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> + return 0;
> +
> + ret = emit_set_data_port_coherency(rq, enable);
> +
> + if (!ret) {
> + if (enable)
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + else
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + }
> +
> + return ret;
> +}
> +
> static void unwind_wa_tail(struct i915_request *rq)
> {
> rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
> @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
> i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
> bool vf_flush_wa = false, dc_flush_wa = false;
> u32 *cs, flags = 0;
> - int len;
> + int err, len;
>
> flags |= PIPE_CONTROL_CS_STALL;
>
> @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
> /* WaForGAMHang:kbl */
> if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
> dc_flush_wa = true;
> +
> + err = intel_lr_context_update_data_port_coherency(request);
> + if (GEM_WARN_ON(err)) {
Awooga awooga! ((tm) by Chris) :))
Please someone review and ack my patch which makes GEM_WARN_ON safe.
Regards,
Tvrtko
> + DRM_DEBUG("Data Port Coherency toggle failed.\n");
> + return err;
> + }
> }
>
> len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 298b2e1..8f8211b 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU and CPU will both keep
> + * changes to memory content visible to each other as fast as possible, by
> + * forcing internal cache units to send memory writes to higher level caches
> + * immediatelly after writes. Only buffers with coherency requested within
> + * surface state, or specific stateless accesses will be affected by this
> + * option. Keeping data port coherency has a performance cost, and therefore
> + * it is by default disabled (see WaForceEnableNonCoherent).
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
> __u64 value;
> };
>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* [PATCH v8] drm/i915: Add IOCTL Param to control data port coherency.
2018-03-19 12:37 ` [RFC v1] drm/i915: Add Exec param to control data port coherency Tomasz Lis
` (6 preceding siblings ...)
2018-10-09 18:06 ` [PATCH v6] " Tomasz Lis
@ 2018-10-12 15:02 ` Tomasz Lis
2018-10-15 12:52 ` Tvrtko Ursulin
2018-10-16 13:59 ` Joonas Lahtinen
7 siblings, 2 replies; 88+ messages in thread
From: Tomasz Lis @ 2018-10-12 15:02 UTC (permalink / raw)
To: intel-gfx; +Cc: bartosz.dunajski
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.
Rationale:
The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:
1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?
Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.
2. Why do we need a global coherency switch?
In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
________________
| NODE1 |
| uint64_t data |
+----------------|
| NODE* | NODE*|
+--------+-------+
/ \
________________/ \________________
| NODE2 | | NODE3 |
| uint64_t data | | uint64_t data |
+----------------| +----------------|
| NODE* | NODE*| | NODE* | NODE*|
+--------+-------+ +--------+-------+
Please note that pointers inside such structures can point to memory locations
in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).
When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.
Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)
3. Will coherency switch be used frequently?
There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine grain SVM resources
(CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
computational work that needs to be executed. kern_master analyzes incoming
work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
the payload that kern_master produced. These two kernels work in a loop, one
after another. Since only kern_master requires coherency, kern_worker should
not be forced to pay for it. This means that we need to have the ability to
toggle coherency switch on or off per each GPU submission:
(ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
v2: Fixed compilation warning.
v3: Refactored the patch to add IOCTL instead of exec flag.
v4: Renamed and documented the API flag. Used strict values.
Removed redundant GEM_WARN_ON()s. Improved to coding standard.
Introduced a macro for checking whether hardware supports the feature.
v5: Renamed some locals. Made the flag write to be lazy.
Updated comments to remove misconceptions. Added gen11 support.
v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
Moved all flags checking to one place. Added mutex check.
v7: Removed 2 comments, improved API comment. (Joonas)
v8: Use non-GEM WARN_ON when in statements. (Tvrtko)
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michal Winiarski <michal.winiarski@intel.com>
Bspec: 11419
Bspec: 19175
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
---
drivers/gpu/drm/i915/i915_drv.h | 1 +
drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
drivers/gpu/drm/i915/intel_lrc.c | 64 ++++++++++++++++++++++++++++++++-
include/uapi/drm/i915_drm.h | 10 ++++++
5 files changed, 116 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3017ef0..90b3a0ff 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
#define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
#define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
+#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
#define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8cbe580..718ede9 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *i915 = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_GTT_SIZE:
if (ctx->ppgtt)
args->value = ctx->ppgtt->vm.total;
- else if (to_i915(dev)->mm.aliasing_ppgtt)
- args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
+ else if (i915->mm.aliasing_ppgtt)
+ args->value = i915->mm.aliasing_ppgtt->vm.total;
else
- args->value = to_i915(dev)->ggtt.vm.total;
+ args->value = i915->ggtt.vm.total;
break;
case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
args->value = i915_gem_context_no_error_capture(ctx);
@@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
case I915_CONTEXT_PARAM_PRIORITY:
args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (!HAS_DATA_PORT_COHERENCY(i915))
+ ret = -ENODEV;
+ else
+ args->value = i915_gem_context_is_data_port_coherent(ctx);
+ break;
default:
ret = -EINVAL;
break;
@@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *file)
{
+ struct drm_i915_private *i915 = to_i915(dev);
struct drm_i915_file_private *file_priv = file->driver_priv;
struct drm_i915_gem_context_param *args = data;
struct i915_gem_context *ctx;
@@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
if (args->size)
ret = -EINVAL;
- else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
+ else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
ret = -ENODEV;
else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
priority < I915_CONTEXT_MIN_USER_PRIORITY)
@@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
}
break;
+ case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
+ if (args->size)
+ ret = -EINVAL;
+ else if (!HAS_DATA_PORT_COHERENCY(i915))
+ ret = -ENODEV;
+ else if (args->value == 1)
+ i915_gem_context_set_data_port_coherent(ctx);
+ else if (args->value == 0)
+ i915_gem_context_clear_data_port_coherent(ctx);
+ else
+ ret = -EINVAL;
+ break;
+
default:
ret = -EINVAL;
break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index f6d870b..69f9247 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -131,6 +131,8 @@ struct i915_gem_context {
#define CONTEXT_BANNED 0
#define CONTEXT_CLOSED 1
#define CONTEXT_FORCE_SINGLE_SUBMISSION 2
+#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 3
+#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 4
/**
* @hw_id: - unique identifier for the context
@@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
atomic_dec(&ctx->hw_id_pin_count);
}
+static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
+{
+ return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
+{
+ __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
+static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
+{
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+}
+
static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
{
return c->user_handle == DEFAULT_CONTEXT_HANDLE;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ff0e2b3..8680bc2 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
ce->lrc_desc = desc;
}
+static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
+{
+ u32 *cs;
+ i915_reg_t reg;
+
+ GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
+ GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
+
+ cs = intel_ring_begin(rq, 4);
+ if (IS_ERR(cs))
+ return PTR_ERR(cs);
+
+ if (INTEL_GEN(rq->i915) >= 11)
+ reg = ICL_HDC_MODE;
+ else if (INTEL_GEN(rq->i915) >= 10)
+ reg = CNL_HDC_CHICKEN0;
+ else
+ reg = HDC_CHICKEN0;
+
+ *cs++ = MI_LOAD_REGISTER_IMM(1);
+ *cs++ = i915_mmio_reg_offset(reg);
+ if (enable)
+ *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
+ else
+ *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
+ *cs++ = MI_NOOP;
+
+ intel_ring_advance(rq, cs);
+
+ return 0;
+}
+
+static int
+intel_lr_context_update_data_port_coherency(struct i915_request *rq)
+{
+ struct i915_gem_context *ctx = rq->gem_context;
+ bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
+ int ret;
+
+ lockdep_assert_held(&rq->i915->drm.struct_mutex);
+
+ if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
+ return 0;
+
+ ret = emit_set_data_port_coherency(rq, enable);
+
+ if (!ret) {
+ if (enable)
+ __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+ else
+ __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
+ }
+
+ return ret;
+}
+
static void unwind_wa_tail(struct i915_request *rq)
{
rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
@@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
bool vf_flush_wa = false, dc_flush_wa = false;
u32 *cs, flags = 0;
- int len;
+ int err, len;
flags |= PIPE_CONTROL_CS_STALL;
@@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
/* WaForGAMHang:kbl */
if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
dc_flush_wa = true;
+
+ err = intel_lr_context_update_data_port_coherency(request);
+ if (WARN_ON(err)) {
+ DRM_DEBUG("Data Port Coherency toggle failed.\n");
+ return err;
+ }
}
len = 6;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 298b2e1..7c9e153 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
#define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
#define I915_CONTEXT_DEFAULT_PRIORITY 0
#define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
+/*
+ * When data port level coherency is enabled, the GPU and CPU will both keep
+ * changes to memory content visible to each other as fast as possible, by
+ * forcing internal cache units to send memory writes to higher level caches
+ * immediately after writes. Only buffers with coherency requested within
+ * surface state, or specific stateless accesses will be affected by this
+ * option. Keeping data port coherency has a performance cost, and therefore
+ * it is by default disabled (see WaForceEnableNonCoherent).
+ */
+#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
__u64 value;
};
--
2.7.4
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 88+ messages in thread
* Re: [PATCH v8] drm/i915: Add IOCTL Param to control data port coherency.
2018-10-12 15:02 ` [PATCH v8] " Tomasz Lis
@ 2018-10-15 12:52 ` Tvrtko Ursulin
2018-10-16 13:59 ` Joonas Lahtinen
1 sibling, 0 replies; 88+ messages in thread
From: Tvrtko Ursulin @ 2018-10-15 12:52 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
On 12/10/2018 16:02, Tomasz Lis wrote:
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
typo in developers
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
>
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
> Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
> Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
> Moved all flags checking to one place. Added mutex check.
> v7: Removed 2 comments, improved API comment. (Joonas)
> v8: Use non-GEM WARN_ON when in statements. (Tvrtko)
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 1 +
> drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
> drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
> drivers/gpu/drm/i915/intel_lrc.c | 64 ++++++++++++++++++++++++++++++++-
> include/uapi/drm/i915_drm.h | 10 ++++++
> 5 files changed, 116 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 3017ef0..90b3a0ff 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8cbe580..718ede9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_GTT_SIZE:
> if (ctx->ppgtt)
> args->value = ctx->ppgtt->vm.total;
> - else if (to_i915(dev)->mm.aliasing_ppgtt)
> - args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> + else if (i915->mm.aliasing_ppgtt)
> + args->value = i915->mm.aliasing_ppgtt->vm.total;
> else
> - args->value = to_i915(dev)->ggtt.vm.total;
> + args->value = i915->ggtt.vm.total;
> break;
> case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
> args->value = i915_gem_context_no_error_capture(ctx);
> @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
> break;
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent(ctx);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>
> if (args->size)
> ret = -EINVAL;
> - else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> + else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> ret = -ENODEV;
> else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
> priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else if (args->value == 1)
> + i915_gem_context_set_data_port_coherent(ctx);
> + else if (args->value == 0)
> + i915_gem_context_clear_data_port_coherent(ctx);
> + else
> + ret = -EINVAL;
> + break;
> +
> default:
> ret = -EINVAL;
> break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b..69f9247 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -131,6 +131,8 @@ struct i915_gem_context {
> #define CONTEXT_BANNED 0
> #define CONTEXT_CLOSED 1
> #define CONTEXT_FORCE_SINGLE_SUBMISSION 2
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 3
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 4
>
> /**
> * @hw_id: - unique identifier for the context
> @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
> atomic_dec(&ctx->hw_id_pin_count);
> }
>
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
> {
> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ff0e2b3..8680bc2 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> + cs = intel_ring_begin(rq, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(rq->i915) >= 11)
> + reg = ICL_HDC_MODE;
> + else if (INTEL_GEN(rq->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(rq, cs);
> +
> + return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> + struct i915_gem_context *ctx = rq->gem_context;
> + bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> + int ret;
> +
> + lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> + if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> + return 0;
> +
> + ret = emit_set_data_port_coherency(rq, enable);
> +
> + if (!ret) {
> + if (enable)
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + else
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + }
> +
> + return ret;
> +}
> +
> static void unwind_wa_tail(struct i915_request *rq)
> {
> rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
> @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
> i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
> bool vf_flush_wa = false, dc_flush_wa = false;
> u32 *cs, flags = 0;
> - int len;
> + int err, len;
>
> flags |= PIPE_CONTROL_CS_STALL;
>
> @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
> /* WaForGAMHang:kbl */
> if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
> dc_flush_wa = true;
> +
> + err = intel_lr_context_update_data_port_coherency(request);
> + if (WARN_ON(err)) {
> + DRM_DEBUG("Data Port Coherency toggle failed.\n");
> + return err;
> + }
> }
>
> len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 298b2e1..7c9e153 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU and CPU will both keep
> + * changes to memory content visible to each other as fast as possible, by
> + * forcing internal cache units to send memory writes to higher level caches
> + * immediately after writes. Only buffers with coherency requested within
> + * surface state, or specific stateless accesses will be affected by this
> + * option. Keeping data port coherency has a performance cost, and therefore
> + * it is by default disabled (see WaForceEnableNonCoherent).
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
> __u64 value;
> };
>
>
Looks okay to me.
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread
* Re: [PATCH v8] drm/i915: Add IOCTL Param to control data port coherency.
2018-10-12 15:02 ` [PATCH v8] " Tomasz Lis
2018-10-15 12:52 ` Tvrtko Ursulin
@ 2018-10-16 13:59 ` Joonas Lahtinen
1 sibling, 0 replies; 88+ messages in thread
From: Joonas Lahtinen @ 2018-10-16 13:59 UTC (permalink / raw)
To: Tomasz Lis, intel-gfx; +Cc: bartosz.dunajski
Quoting Tomasz Lis (2018-10-12 18:02:56)
> The patch adds a parameter to control the data port coherency functionality
> on a per-context level. When the IOCTL is called, a command to switch data
> port coherency state is added to the ordered list. All prior requests are
> executed on old coherency settings, and all exec requests after the IOCTL
> will use new settings.
>
> Rationale:
>
> The OpenCL driver develpers requested a functionality to control cache
> coherency at data port level. Keeping the coherency at that level is disabled
> by default due to its performance costs. OpenCL driver is planning to
> enable it for a small subset of submissions, when such functionality is
> required. Below are answers to basic question explaining background
> of the functionality and reasoning for the proposed implementation:
>
> 1. Why do we need a coherency enable/disable switch for memory that is shared
> between CPU and GEN (GPU)?
>
> Memory coherency between CPU and GEN, while being a great feature that enables
> CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
> overhead related to tracking (snooping) memory inside different cache units
> (L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
> applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
> memory coherency between CPU and GPU). The goal of coherency enable/disable
> switch is to remove overhead of memory coherency when memory coherency is not
> needed.
>
> 2. Why do we need a global coherency switch?
>
> In order to support I/O commands from within EUs (Execution Units), Intel GEN
> ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
> These send instructions provide several addressing models. One of these
> addressing models (named "stateless") provides most flexible I/O using plain
> virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
> model is similar to regular memory load/store operations available on typical
> CPUs. Since this model provides I/O using arbitrary virtual addresses, it
> enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
> of pointers) concepts. For instance, it allows creating tree-like data
> structures such as:
> ________________
> | NODE1 |
> | uint64_t data |
> +----------------|
> | NODE* | NODE*|
> +--------+-------+
> / \
> ________________/ \________________
> | NODE2 | | NODE3 |
> | uint64_t data | | uint64_t data |
> +----------------| +----------------|
> | NODE* | NODE*| | NODE* | NODE*|
> +--------+-------+ +--------+-------+
>
> Please note that pointers inside such structures can point to memory locations
> in different OCL allocations - e.g. NODE1 and NODE2 can reside in one OCL
> allocation while NODE3 resides in a completely separate OCL allocation.
> Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
> Virtual Memory feature). Using pointers from different allocations doesn't
> affect the stateless addressing model which even allows scattered reading from
> different allocations at the same time (i.e. by utilizing SIMD-nature of send
> instructions).
>
> When it comes to coherency programming, send instructions in stateless model
> can be encoded (at ISA level) to either use or disable coherency. However, for
> generic OCL applications (such as example with tree-like data structure), OCL
> compiler is not able to determine origin of memory pointed to by an arbitrary
> pointer - i.e. is not able to track given pointer back to a specific
> allocation. As such, it's not able to decide whether coherency is needed or not
> for specific pointer (or for specific I/O instruction). As a result, compiler
> encodes all stateless sends as coherent (doing otherwise would lead to
> functional issues resulting from data corruption). Please note that it would be
> possible to workaround this (e.g. based on allocations map and pointer bounds
> checking prior to each I/O instruction) but the performance cost of such
> workaround would be many times greater than the cost of keeping coherency
> always enabled. As such, enabling/disabling memory coherency at GEN ISA level
> is not feasible and alternative method is needed.
>
> Such alternative solution is to have a global coherency switch that allows
> disabling coherency for single (though entire) GPU submission. This is
> beneficial because this way we:
> * can enable (and pay for) coherency only in submissions that actually need
> coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
> * don't care about coherency at GEN ISA granularity (no performance impact)
Might be worthy mentioning that this address space compatibility can be
achieved with userptr + soft-pinning allocations to their process space
addresses.
Anyway, this is;
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
But as mentioned previously, getting this merged needs the test to be
finished and clarity from the userspace project side.
Regards, Joonas
> 3. Will coherency switch be used frequently?
>
> There are scenarios that will require frequent toggling of the coherency
> switch.
> E.g. an application has two OCL compute kernels: kern_master and kern_worker.
> kern_master uses, concurrently with CPU, some fine grain SVM resources
> (CL_MEM_SVM_FINE_GRAIN_BUFFER). These resources contain descriptors of
> computational work that needs to be executed. kern_master analyzes incoming
> work descriptors and populates a plain OCL buffer (non-fine-grain) with payload
> for kern_worker. Once kern_master is done, kern_worker kicks-in and processes
> the payload that kern_master produced. These two kernels work in a loop, one
> after another. Since only kern_master requires coherency, kern_worker should
> not be forced to pay for it. This means that we need to have the ability to
> toggle coherency switch on or off per each GPU submission:
> (ENABLE COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> (ENABLE
> COHERENCY) kern_master -> (DISABLE COHERENCY)kern_worker -> ...
>
> v2: Fixed compilation warning.
> v3: Refactored the patch to add IOCTL instead of exec flag.
> v4: Renamed and documented the API flag. Used strict values.
> Removed redundant GEM_WARN_ON()s. Improved to coding standard.
> Introduced a macro for checking whether hardware supports the feature.
> v5: Renamed some locals. Made the flag write to be lazy.
> Updated comments to remove misconceptions. Added gen11 support.
> v6: Moved the flag write to gen8_enit_flush_render(). Renamed some functions.
> Moved all flags checking to one place. Added mutex check.
> v7: Removed 2 comments, improved API comment. (Joonas)
> v8: Use non-GEM WARN_ON when in statements. (Tvrtko)
>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michal Winiarski <michal.winiarski@intel.com>
>
> Bspec: 11419
> Bspec: 19175
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 1 +
> drivers/gpu/drm/i915/i915_gem_context.c | 29 ++++++++++++---
> drivers/gpu/drm/i915/i915_gem_context.h | 17 +++++++++
> drivers/gpu/drm/i915/intel_lrc.c | 64 ++++++++++++++++++++++++++++++++-
> include/uapi/drm/i915_drm.h | 10 ++++++
> 5 files changed, 116 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 3017ef0..90b3a0ff 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2588,6 +2588,7 @@ intel_info(const struct drm_i915_private *dev_priv)
> #define HAS_EDRAM(dev_priv) (!!((dev_priv)->edram_cap & EDRAM_ENABLED))
> #define HAS_WT(dev_priv) ((IS_HASWELL(dev_priv) || \
> IS_BROADWELL(dev_priv)) && HAS_EDRAM(dev_priv))
> +#define HAS_DATA_PORT_COHERENCY(dev_priv) (INTEL_GEN(dev_priv) >= 9)
>
> #define HWS_NEEDS_PHYSICAL(dev_priv) ((dev_priv)->info.hws_needs_physical)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8cbe580..718ede9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -847,6 +847,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -867,10 +868,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_GTT_SIZE:
> if (ctx->ppgtt)
> args->value = ctx->ppgtt->vm.total;
> - else if (to_i915(dev)->mm.aliasing_ppgtt)
> - args->value = to_i915(dev)->mm.aliasing_ppgtt->vm.total;
> + else if (i915->mm.aliasing_ppgtt)
> + args->value = i915->mm.aliasing_ppgtt->vm.total;
> else
> - args->value = to_i915(dev)->ggtt.vm.total;
> + args->value = i915->ggtt.vm.total;
> break;
> case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
> args->value = i915_gem_context_no_error_capture(ctx);
> @@ -881,6 +882,12 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> case I915_CONTEXT_PARAM_PRIORITY:
> args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
> break;
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else
> + args->value = i915_gem_context_is_data_port_coherent(ctx);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -893,6 +900,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file)
> {
> + struct drm_i915_private *i915 = to_i915(dev);
> struct drm_i915_file_private *file_priv = file->driver_priv;
> struct drm_i915_gem_context_param *args = data;
> struct i915_gem_context *ctx;
> @@ -939,7 +947,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>
> if (args->size)
> ret = -EINVAL;
> - else if (!(to_i915(dev)->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> + else if (!(i915->caps.scheduler & I915_SCHEDULER_CAP_PRIORITY))
> ret = -ENODEV;
> else if (priority > I915_CONTEXT_MAX_USER_PRIORITY ||
> priority < I915_CONTEXT_MIN_USER_PRIORITY)
> @@ -953,6 +961,19 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
> }
> break;
>
> + case I915_CONTEXT_PARAM_DATA_PORT_COHERENCY:
> + if (args->size)
> + ret = -EINVAL;
> + else if (!HAS_DATA_PORT_COHERENCY(i915))
> + ret = -ENODEV;
> + else if (args->value == 1)
> + i915_gem_context_set_data_port_coherent(ctx);
> + else if (args->value == 0)
> + i915_gem_context_clear_data_port_coherent(ctx);
> + else
> + ret = -EINVAL;
> + break;
> +
> default:
> ret = -EINVAL;
> break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b..69f9247 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -131,6 +131,8 @@ struct i915_gem_context {
> #define CONTEXT_BANNED 0
> #define CONTEXT_CLOSED 1
> #define CONTEXT_FORCE_SINGLE_SUBMISSION 2
> +#define CONTEXT_DATA_PORT_COHERENT_REQUESTED 3
> +#define CONTEXT_DATA_PORT_COHERENT_ACTIVE 4
>
> /**
> * @hw_id: - unique identifier for the context
> @@ -283,6 +285,21 @@ static inline void i915_gem_context_unpin_hw_id(struct i915_gem_context *ctx)
> atomic_dec(&ctx->hw_id_pin_count);
> }
>
> +static inline bool i915_gem_context_is_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + return test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_set_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> +static inline void i915_gem_context_clear_data_port_coherent(struct i915_gem_context *ctx)
> +{
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> +}
> +
> static inline bool i915_gem_context_is_default(const struct i915_gem_context *c)
> {
> return c->user_handle == DEFAULT_CONTEXT_HANDLE;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ff0e2b3..8680bc2 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -259,6 +259,62 @@ intel_lr_context_descriptor_update(struct i915_gem_context *ctx,
> ce->lrc_desc = desc;
> }
>
> +static int emit_set_data_port_coherency(struct i915_request *rq, bool enable)
> +{
> + u32 *cs;
> + i915_reg_t reg;
> +
> + GEM_BUG_ON(rq->engine->class != RENDER_CLASS);
> + GEM_BUG_ON(INTEL_GEN(rq->i915) < 9);
> +
> + cs = intel_ring_begin(rq, 4);
> + if (IS_ERR(cs))
> + return PTR_ERR(cs);
> +
> + if (INTEL_GEN(rq->i915) >= 11)
> + reg = ICL_HDC_MODE;
> + else if (INTEL_GEN(rq->i915) >= 10)
> + reg = CNL_HDC_CHICKEN0;
> + else
> + reg = HDC_CHICKEN0;
> +
> + *cs++ = MI_LOAD_REGISTER_IMM(1);
> + *cs++ = i915_mmio_reg_offset(reg);
> + if (enable)
> + *cs++ = _MASKED_BIT_DISABLE(HDC_FORCE_NON_COHERENT);
> + else
> + *cs++ = _MASKED_BIT_ENABLE(HDC_FORCE_NON_COHERENT);
> + *cs++ = MI_NOOP;
> +
> + intel_ring_advance(rq, cs);
> +
> + return 0;
> +}
> +
> +static int
> +intel_lr_context_update_data_port_coherency(struct i915_request *rq)
> +{
> + struct i915_gem_context *ctx = rq->gem_context;
> + bool enable = test_bit(CONTEXT_DATA_PORT_COHERENT_REQUESTED, &ctx->flags);
> + int ret;
> +
> + lockdep_assert_held(&rq->i915->drm.struct_mutex);
> +
> + if (test_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags) == enable)
> + return 0;
> +
> + ret = emit_set_data_port_coherency(rq, enable);
> +
> + if (!ret) {
> + if (enable)
> + __set_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + else
> + __clear_bit(CONTEXT_DATA_PORT_COHERENT_ACTIVE, &ctx->flags);
> + }
> +
> + return ret;
> +}
> +
> static void unwind_wa_tail(struct i915_request *rq)
> {
> rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
> @@ -1965,7 +2021,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
> i915_ggtt_offset(engine->scratch) + 2 * CACHELINE_BYTES;
> bool vf_flush_wa = false, dc_flush_wa = false;
> u32 *cs, flags = 0;
> - int len;
> + int err, len;
>
> flags |= PIPE_CONTROL_CS_STALL;
>
> @@ -1996,6 +2052,12 @@ static int gen8_emit_flush_render(struct i915_request *request,
> /* WaForGAMHang:kbl */
> if (IS_KBL_REVID(request->i915, 0, KBL_REVID_B0))
> dc_flush_wa = true;
> +
> + err = intel_lr_context_update_data_port_coherency(request);
> + if (WARN_ON(err)) {
> + DRM_DEBUG("Data Port Coherency toggle failed.\n");
> + return err;
> + }
> }
>
> len = 6;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 298b2e1..7c9e153 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1486,6 +1486,16 @@ struct drm_i915_gem_context_param {
> #define I915_CONTEXT_MAX_USER_PRIORITY 1023 /* inclusive */
> #define I915_CONTEXT_DEFAULT_PRIORITY 0
> #define I915_CONTEXT_MIN_USER_PRIORITY -1023 /* inclusive */
> +/*
> + * When data port level coherency is enabled, the GPU and CPU will both keep
> + * changes to memory content visible to each other as fast as possible, by
> + * forcing internal cache units to send memory writes to higher level caches
> + * immediately after writes. Only buffers with coherency requested within
> + * surface state, or specific stateless accesses will be affected by this
> + * option. Keeping data port coherency has a performance cost, and therefore
> + * it is by default disabled (see WaForceEnableNonCoherent).
> + */
> +#define I915_CONTEXT_PARAM_DATA_PORT_COHERENCY 0x7
> __u64 value;
> };
>
> --
> 2.7.4
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 88+ messages in thread