All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] xen/arm: Handling cache maintenance instructions by set/way
@ 2017-12-05 18:39 Julien Grall
  2017-12-05 22:35 ` Stefano Stabellini
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-05 18:39 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, Andrew Cooper, George Dunlap,
	Stefano Stabellini, Andre Przywara, Tim Deegan

Hi all,

Even though it is an Arm failure, I have CCed x86 folks to get feedback 
on the approach. I have a WIP branch I could share if that interest people.

Few months ago, we noticed an heisenbug on jobs run by osstest on the 
cubietrucks (see [1]). From the log, we figured out that the guest vCPU 
0 is in data/prefetch abort state at early boot. I have been able to 
reproduce it reliably, although from the little information I have I 
think it is related to a cache issue because we don't trap cache 
maintenance instructions by set/way.

This is a set of 3 instructions (clean, clean & invalidate, invalidate) 
working on a given cache level by S/W. Because the OS is not allowed to 
infer the S/W to PA mapping, it can only use S/W to nuke the whole 
cache. "The expected usage of the cache maintenance that operate by 
set/way is associated with powerdown and powerup of caches, if this is 
required by the implementation" (see D3-2020 ARM DDI 0487B.b).

Those instructions will target a local processor and usually working in 
batch for nuking the cache. This means if the vCPU is migrated to 
another pCPU in the middle of the process, the cache may not be cleaned. 
This would result to data corruption and potential crash of the OS.

Thankfully, the Arm architecture offers a way to trap all the cache 
maintenance instructions by S/W (e.g HCR_EL2.TSW). Xen will need to set 
that bit and handle S/W.

The major question now is how to handle them. S/W instructions are 
difficult to virtualize (see ARMv7 ARM B1.14.4).

The suggested policy is based on the KVM one:
	- If we trap a S/W instructions, we enable VM trapping (e.g 
HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
	- We flush the caches on both caches being turned on and off.
	- Once the caches are enabled, we stop trapping VM instructions.

Doing a full clean will require to go through the P2M and flush the 
entries one by one. At the moment, all the memory is mapped. As you can 
imagine flushing guest with hundreds of MB will take a very long time 
(Linux timeout during CPU bring).

Therefore, we need a way to limit the number of entries we need to 
flush. The suggested solution here is to introduce Populate On Demand 
(PoD) on Arm.

The guest would boot with no RAM mapped in stage-2 page-table. At every 
prefetch/data abort, the RAM would be mapped using preferably 2MB chunk 
or 4KB. This means that when S/W would be used, the number of entries 
mapped would be very limited. However, for safety, the flush should be 
preemptible.

For those been worry about the performance impact, I have looked at the 
current use of S/W instructions:
	- Linux Arm64: The last used in the kernel was beginning of 2015
	- Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No 
plan to change.
	- UEFI: A couple of use in UEFI, but I have heard they plan to remove 
them (need confirmation).

I haven't looked at all the OSes. However, given the Arm Arm clearly 
state S/W instructions are not easily virtualizable, I would expect 
guest OSes developers to try there best to limit the use of the 
instructions.

To limit the performance impact, we could introduce a guest option to 
tell whether the guest will use S/W. If it does plan to use S/W, PoD 
will be disabled.

Now regarding the hardware domain. At the moment, it has its RAM direct 
mapped. Supporting direct mapping in PoD will be quite a pain for a 
limited benefits (see why above). In that case I would suggest to impose 
vCPU pinning for the hardware domain if the S/W are expected to be used. 
Again, a command line option could be introduced here.

Any feedbacks on the approach will be welcomed.

Cheers,

[1] 
https://lists.xenproject.org/archives/html/xen-devel/2017-09/msg03191.html

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-05 18:39 [RFC] xen/arm: Handling cache maintenance instructions by set/way Julien Grall
@ 2017-12-05 22:35 ` Stefano Stabellini
  2017-12-05 22:54   ` Julien Grall
  2017-12-06  9:15 ` Jan Beulich
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Stefano Stabellini @ 2017-12-05 22:35 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Jan Beulich, Andrew Cooper, xen-devel

On Tue, 5 Dec 2017, Julien Grall wrote:
> Hi all,
> 
> Even though it is an Arm failure, I have CCed x86 folks to get feedback on the
> approach. I have a WIP branch I could share if that interest people.
> 
> Few months ago, we noticed an heisenbug on jobs run by osstest on the
> cubietrucks (see [1]). From the log, we figured out that the guest vCPU 0 is
> in data/prefetch abort state at early boot. I have been able to reproduce it
> reliably, although from the little information I have I think it is related to
> a cache issue because we don't trap cache maintenance instructions by set/way.
> 
> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
> working on a given cache level by S/W. Because the OS is not allowed to infer
> the S/W to PA mapping, it can only use S/W to nuke the whole cache. "The
> expected usage of the cache maintenance that operate by set/way is associated
> with powerdown and powerup of caches, if this is required by the
> implementation" (see D3-2020 ARM DDI 0487B.b).
> 
> Those instructions will target a local processor and usually working in batch
> for nuking the cache. This means if the vCPU is migrated to another pCPU in
> the middle of the process, the cache may not be cleaned. This would result to
> data corruption and potential crash of the OS.
> 
> Thankfully, the Arm architecture offers a way to trap all the cache
> maintenance instructions by S/W (e.g HCR_EL2.TSW). Xen will need to set that
> bit and handle S/W.
> 
> The major question now is how to handle them. S/W instructions are difficult
> to virtualize (see ARMv7 ARM B1.14.4).
> 
> The suggested policy is based on the KVM one:
> 	- If we trap a S/W instructions, we enable VM trapping (e.g
> HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
> 	- We flush the caches on both caches being turned on and off.
> 	- Once the caches are enabled, we stop trapping VM instructions.
> 
> Doing a full clean will require to go through the P2M and flush the entries
> one by one. At the moment, all the memory is mapped. As you can imagine
> flushing guest with hundreds of MB will take a very long time (Linux timeout
> during CPU bring).
> 
> Therefore, we need a way to limit the number of entries we need to flush. The
> suggested solution here is to introduce Populate On Demand (PoD) on Arm.
> 
> The guest would boot with no RAM mapped in stage-2 page-table. At every
> prefetch/data abort, the RAM would be mapped using preferably 2MB chunk or
> 4KB. This means that when S/W would be used, the number of entries mapped
> would be very limited. However, for safety, the flush should be preemptible.
> 
> For those been worry about the performance impact, I have looked at the
> current use of S/W instructions:
> 	- Linux Arm64: The last used in the kernel was beginning of 2015
> 	- Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
> plan to change.
> 	- UEFI: A couple of use in UEFI, but I have heard they plan to remove
> them (need confirmation).
> 
> I haven't looked at all the OSes. However, given the Arm Arm clearly state S/W
> instructions are not easily virtualizable, I would expect guest OSes
> developers to try there best to limit the use of the instructions.
> 
> To limit the performance impact, we could introduce a guest option to tell
> whether the guest will use S/W. If it does plan to use S/W, PoD will be
> disabled.
> 
> Now regarding the hardware domain. At the moment, it has its RAM direct
> mapped. Supporting direct mapping in PoD will be quite a pain for a limited
> benefits (see why above). In that case I would suggest to impose vCPU pinning
> for the hardware domain if the S/W are expected to be used. Again, a command
> line option could be introduced here.
> 
> Any feedbacks on the approach will be welcomed.
 
Could we pin the hwdom vcpus only at boot time, until all S/W operations
are issued, then "release" them? If we can detect the last expected S/W
operation with some sort of heuristic.

Given the information provided above, would it make sense to consider
avoiding PoD for arm64 kernel direct boots?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-05 22:35 ` Stefano Stabellini
@ 2017-12-05 22:54   ` Julien Grall
  0 siblings, 0 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-05 22:54 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: George Dunlap, Andre Przywara, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel



On 05/12/2017 22:35, Stefano Stabellini wrote:
> On Tue, 5 Dec 2017, Julien Grall wrote:
>> Hi all,
>>
>> Even though it is an Arm failure, I have CCed x86 folks to get feedback on the
>> approach. I have a WIP branch I could share if that interest people.
>>
>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU 0 is
>> in data/prefetch abort state at early boot. I have been able to reproduce it
>> reliably, although from the little information I have I think it is related to
>> a cache issue because we don't trap cache maintenance instructions by set/way.
>>
>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>> working on a given cache level by S/W. Because the OS is not allowed to infer
>> the S/W to PA mapping, it can only use S/W to nuke the whole cache. "The
>> expected usage of the cache maintenance that operate by set/way is associated
>> with powerdown and powerup of caches, if this is required by the
>> implementation" (see D3-2020 ARM DDI 0487B.b).
>>
>> Those instructions will target a local processor and usually working in batch
>> for nuking the cache. This means if the vCPU is migrated to another pCPU in
>> the middle of the process, the cache may not be cleaned. This would result to
>> data corruption and potential crash of the OS.
>>
>> Thankfully, the Arm architecture offers a way to trap all the cache
>> maintenance instructions by S/W (e.g HCR_EL2.TSW). Xen will need to set that
>> bit and handle S/W.
>>
>> The major question now is how to handle them. S/W instructions are difficult
>> to virtualize (see ARMv7 ARM B1.14.4).
>>
>> The suggested policy is based on the KVM one:
>> 	- If we trap a S/W instructions, we enable VM trapping (e.g
>> HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
>> 	- We flush the caches on both caches being turned on and off.
>> 	- Once the caches are enabled, we stop trapping VM instructions.
>>
>> Doing a full clean will require to go through the P2M and flush the entries
>> one by one. At the moment, all the memory is mapped. As you can imagine
>> flushing guest with hundreds of MB will take a very long time (Linux timeout
>> during CPU bring).
>>
>> Therefore, we need a way to limit the number of entries we need to flush. The
>> suggested solution here is to introduce Populate On Demand (PoD) on Arm.
>>
>> The guest would boot with no RAM mapped in stage-2 page-table. At every
>> prefetch/data abort, the RAM would be mapped using preferably 2MB chunk or
>> 4KB. This means that when S/W would be used, the number of entries mapped
>> would be very limited. However, for safety, the flush should be preemptible.
>>
>> For those been worry about the performance impact, I have looked at the
>> current use of S/W instructions:
>> 	- Linux Arm64: The last used in the kernel was beginning of 2015
>> 	- Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
>> plan to change.
>> 	- UEFI: A couple of use in UEFI, but I have heard they plan to remove
>> them (need confirmation).
>>
>> I haven't looked at all the OSes. However, given the Arm Arm clearly state S/W
>> instructions are not easily virtualizable, I would expect guest OSes
>> developers to try there best to limit the use of the instructions.
>>
>> To limit the performance impact, we could introduce a guest option to tell
>> whether the guest will use S/W. If it does plan to use S/W, PoD will be
>> disabled.
>>
>> Now regarding the hardware domain. At the moment, it has its RAM direct
>> mapped. Supporting direct mapping in PoD will be quite a pain for a limited
>> benefits (see why above). In that case I would suggest to impose vCPU pinning
>> for the hardware domain if the S/W are expected to be used. Again, a command
>> line option could be introduced here.
>>
>> Any feedbacks on the approach will be welcomed.
>   
> Could we pin the hwdom vcpus only at boot time, until all S/W operations
> are issued, then "release" them? If we can detect the last expected S/W
> operation with some sort of heuristic.

Feel free to suggest a way. I haven't found it. But to be honest, you 
have seen how much people care about 32-bit hwdom today. So I would not 
spend too much time thinking about optimizing it.

> 
> Given the information provided above, would it make sense to consider
> avoiding PoD for arm64 kernel direct boots?

Please suggest a way to kernel an arm64 kernel direct boot and not using 
S/W. I don't see any.

The only solution, I can see, is to provide a configuration option at 
boot time as I suggested a bit above:

"To limit the performance impact, we could introduce a guest option to 
tell whether the guest will use S/W. If it does plan to use S/W, PoD 
will be disabled."

But at this stage, my concern is fixing blatant bug in Xen and 
performance is a second step.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-05 18:39 [RFC] xen/arm: Handling cache maintenance instructions by set/way Julien Grall
  2017-12-05 22:35 ` Stefano Stabellini
@ 2017-12-06  9:15 ` Jan Beulich
  2017-12-06 12:10   ` Julien Grall
  2017-12-06 12:28 ` George Dunlap
  2017-12-06 15:10 ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2017-12-06  9:15 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Andrew Cooper, xen-devel

>>> On 05.12.17 at 19:39, <julien.grall@linaro.org> wrote:
> The suggested policy is based on the KVM one:
> 	- If we trap a S/W instructions, we enable VM trapping (e.g 
> HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
> 	- We flush the caches on both caches being turned on and off.
> 	- Once the caches are enabled, we stop trapping VM instructions.
> 
> Doing a full clean will require to go through the P2M and flush the 
> entries one by one. At the moment, all the memory is mapped. As you can 
> imagine flushing guest with hundreds of MB will take a very long time 
> (Linux timeout during CPU bring).
> 
> Therefore, we need a way to limit the number of entries we need to 
> flush. The suggested solution here is to introduce Populate On Demand 
> (PoD) on Arm.
> 
> The guest would boot with no RAM mapped in stage-2 page-table. At every 
> prefetch/data abort, the RAM would be mapped using preferably 2MB chunk 
> or 4KB. This means that when S/W would be used, the number of entries 
> mapped would be very limited. However, for safety, the flush should be 
> preemptible.

For my own understanding: Here you suggest to use PoD in order
to deal with S/W insn interception.

> To limit the performance impact, we could introduce a guest option to 
> tell whether the guest will use S/W. If it does plan to use S/W, PoD 
> will be disabled.

Therefore I'm wondering if here you mean "If it doesn't plan to ..."

Independent of this I'm pretty unclear about your conclusion that
there will be only a very limited number of P2M entries at the time
S/W insns would be used by the guest. Are you ignoring potentially
malicious guests for the moment? Otoh you admit that things would
need to be preemptible, so perhaps the argumentation is that you
simply expect well behaved guests to only have such limited amount
of P2M entries.

Am I, btw, understanding correctly that other than on x86 you
intend PoD to not be used for maxmem > memory scenarios, at
least for the time being?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06  9:15 ` Jan Beulich
@ 2017-12-06 12:10   ` Julien Grall
  0 siblings, 0 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-06 12:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Andrew Cooper, xen-devel

Hi Jan,

On 12/06/2017 09:15 AM, Jan Beulich wrote:
>>>> On 05.12.17 at 19:39, <julien.grall@linaro.org> wrote:
>> The suggested policy is based on the KVM one:
>> 	- If we trap a S/W instructions, we enable VM trapping (e.g
>> HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
>> 	- We flush the caches on both caches being turned on and off.
>> 	- Once the caches are enabled, we stop trapping VM instructions.
>>
>> Doing a full clean will require to go through the P2M and flush the
>> entries one by one. At the moment, all the memory is mapped. As you can
>> imagine flushing guest with hundreds of MB will take a very long time
>> (Linux timeout during CPU bring).
>>
>> Therefore, we need a way to limit the number of entries we need to
>> flush. The suggested solution here is to introduce Populate On Demand
>> (PoD) on Arm.
>>
>> The guest would boot with no RAM mapped in stage-2 page-table. At every
>> prefetch/data abort, the RAM would be mapped using preferably 2MB chunk
>> or 4KB. This means that when S/W would be used, the number of entries
>> mapped would be very limited. However, for safety, the flush should be
>> preemptible.
> 
> For my own understanding: Here you suggest to use PoD in order
> to deal with S/W insn interception.

That's right. PoD would limit the number of entry to flush.

> 
>> To limit the performance impact, we could introduce a guest option to
>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>> will be disabled.
> 
> Therefore I'm wondering if here you mean "If it doesn't plan to ..."

Whoops. I meant "If it doesn't plan".

> 
> Independent of this I'm pretty unclear about your conclusion that
> there will be only a very limited number of P2M entries at the time
> S/W insns would be used by the guest. Are you ignoring potentially
> malicious guests for the moment? Otoh you admit that things would
> need to be preemptible, so perhaps the argumentation is that you
> simply expect well behaved guests to only have such limited amount
> of P2M entries.

The preemption is to cover malicious guests and potentially well-behaved 
guests use case I missed. But TBH, the latter would be a call for the OS 
to be reworked as fast emulation of S/W will be really difficult.

> 
> Am I, btw, understanding correctly that other than on x86 you
> intend PoD to not be used for maxmem > memory scenarios, at
> least for the time being?

Yes. I don't think it would be difficult to add that support for Arm as 
well.

Also, at the moment, PoD code is nearly a verbatim copy of the x86 
version. And this is only because the interface with the rest p2m code. 
I am planning to discuss on the ML the possibility to share the PoD code.

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-05 18:39 [RFC] xen/arm: Handling cache maintenance instructions by set/way Julien Grall
  2017-12-05 22:35 ` Stefano Stabellini
  2017-12-06  9:15 ` Jan Beulich
@ 2017-12-06 12:28 ` George Dunlap
  2017-12-06 12:58   ` Julien Grall
  2017-12-06 15:10 ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 41+ messages in thread
From: George Dunlap @ 2017-12-06 12:28 UTC (permalink / raw)
  To: Julien Grall, xen-devel, Jan Beulich, Andrew Cooper,
	George Dunlap, Stefano Stabellini, Andre Przywara, Tim Deegan

On 12/05/2017 06:39 PM, Julien Grall wrote:
> Hi all,
> 
> Even though it is an Arm failure, I have CCed x86 folks to get feedback
> on the approach. I have a WIP branch I could share if that interest people.
> 
> Few months ago, we noticed an heisenbug on jobs run by osstest on the
> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
> 0 is in data/prefetch abort state at early boot. I have been able to
> reproduce it reliably, although from the little information I have I
> think it is related to a cache issue because we don't trap cache
> maintenance instructions by set/way.
> 
> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
> working on a given cache level by S/W. Because the OS is not allowed to
> infer the S/W to PA mapping, it can only use S/W to nuke the whole
> cache. "The expected usage of the cache maintenance that operate by
> set/way is associated with powerdown and powerup of caches, if this is
> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
> 
> Those instructions will target a local processor and usually working in
> batch for nuking the cache. This means if the vCPU is migrated to
> another pCPU in the middle of the process, the cache may not be cleaned.
> This would result to data corruption and potential crash of the OS.

I don't quite understand the failure mode here: Why does vCPU migration
cause cache inconsistency in the middle of one of these "cleans", but
not under normal operation?

> For those been worry about the performance impact, I have looked at the
> current use of S/W instructions:
>     - Linux Arm64: The last used in the kernel was beginning of 2015
>     - Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
> plan to change.
>     - UEFI: A couple of use in UEFI, but I have heard they plan to
> remove them (need confirmation).
> 
> I haven't looked at all the OSes. However, given the Arm Arm clearly
> state S/W instructions are not easily virtualizable, I would expect
> guest OSes developers to try there best to limit the use of the
> instructions.
> 
> To limit the performance impact, we could introduce a guest option to
> tell whether the guest will use S/W. If it does plan to use S/W, PoD
> will be disabled.
> 
> Now regarding the hardware domain. At the moment, it has its RAM direct
> mapped. Supporting direct mapping in PoD will be quite a pain for a
> limited benefits (see why above). In that case I would suggest to impose
> vCPU pinning for the hardware domain if the S/W are expected to be used.
> Again, a command line option could be introduced here.
> 
> Any feedbacks on the approach will be welcomed.

I still don't entirely understand the underlying failure mode, but there
are a couple of things we could consider:

1. Automatically disabling 'vcpu migration' when caching is turned off.
This wouldn't prevent a vcpu from being preempted, just from being run
somewhere else.

2. It sounds like rather than using PoD, you could use the
"misconfigured p2m table" technique that x86 uses: set bits in the p2m
entry which cause a specific kind of HAP fault when accessed.  The fault
handler then looks in the p2m entry, and if it finds an otherwise valid
entry, it just fixes the "misconfigured" bits and continues.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 12:28 ` George Dunlap
@ 2017-12-06 12:58   ` Julien Grall
  2017-12-06 13:01     ` Julien Grall
                       ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-06 12:58 UTC (permalink / raw)
  To: George Dunlap, xen-devel, Jan Beulich, Andrew Cooper,
	George Dunlap, Stefano Stabellini, Andre Przywara, Tim Deegan

Hi George,

On 12/06/2017 12:28 PM, George Dunlap wrote:
> On 12/05/2017 06:39 PM, Julien Grall wrote:
>> Hi all,
>>
>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>> on the approach. I have a WIP branch I could share if that interest people.
>>
>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>> 0 is in data/prefetch abort state at early boot. I have been able to
>> reproduce it reliably, although from the little information I have I
>> think it is related to a cache issue because we don't trap cache
>> maintenance instructions by set/way.
>>
>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>> working on a given cache level by S/W. Because the OS is not allowed to
>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>> cache. "The expected usage of the cache maintenance that operate by
>> set/way is associated with powerdown and powerup of caches, if this is
>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>
>> Those instructions will target a local processor and usually working in
>> batch for nuking the cache. This means if the vCPU is migrated to
>> another pCPU in the middle of the process, the cache may not be cleaned.
>> This would result to data corruption and potential crash of the OS.
> 
> I don't quite understand the failure mode here: Why does vCPU migration
> cause cache inconsistency in the middle of one of these "cleans", but
> not under normal operation?

Because they target a specific S/W cache level whereas other cache 
operations are working with VA.

To make it short, the other VA cache instructions will work to Poinut of 
Coherency/Point of Unification and guarantee that the caches will be 
consistent. For more details see B2.2.6 in ARM DDI 046C.c.

> 
>> For those been worry about the performance impact, I have looked at the
>> current use of S/W instructions:
>>      - Linux Arm64: The last used in the kernel was beginning of 2015
>>      - Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
>> plan to change.
>>      - UEFI: A couple of use in UEFI, but I have heard they plan to
>> remove them (need confirmation).
>>
>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>> state S/W instructions are not easily virtualizable, I would expect
>> guest OSes developers to try there best to limit the use of the
>> instructions.
>>
>> To limit the performance impact, we could introduce a guest option to
>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>> will be disabled.
>>
>> Now regarding the hardware domain. At the moment, it has its RAM direct
>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>> limited benefits (see why above). In that case I would suggest to impose
>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>> Again, a command line option could be introduced here.
>>
>> Any feedbacks on the approach will be welcomed.
> 
> I still don't entirely understand the underlying failure mode, but there
> are a couple of things we could consider:
> 
> 1. Automatically disabling 'vcpu migration' when caching is turned off.
> This wouldn't prevent a vcpu from being preempted, just from being run
> somewhere else.

This suggest the guest will directly perform S/W, right? So you leave 
the possibility to the guest to flush all caches the vCPU can access. 
This an easy way for the guest to affect the cache entry of other guests.

I think this would help some potential data attack.

> 
> 2. It sounds like rather than using PoD, you could use the
> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
> entry which cause a specific kind of HAP fault when accessed.  The fault
> handler then looks in the p2m entry, and if it finds an otherwise valid
> entry, it just fixes the "misconfigured" bits and continues.

I thought about this. But when do you set the entry to misconfigured?

If you take the example of Linux 32-bit. There are a couple of full 
cache clean during the boot of uni-processor. So you would need to go 
through the p2m multiple time and reset the access bits.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 12:58   ` Julien Grall
@ 2017-12-06 13:01     ` Julien Grall
  2017-12-06 15:15     ` Jan Beulich
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-06 13:01 UTC (permalink / raw)
  To: George Dunlap, xen-devel, Jan Beulich, Andrew Cooper,
	George Dunlap, Stefano Stabellini, Andre Przywara, Tim Deegan



On 12/06/2017 12:58 PM, Julien Grall wrote:
> Hi George,
> 
> On 12/06/2017 12:28 PM, George Dunlap wrote:
>> On 12/05/2017 06:39 PM, Julien Grall wrote:
>>> Hi all,
>>>
>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>>> on the approach. I have a WIP branch I could share if that interest 
>>> people.
>>>
>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>>> 0 is in data/prefetch abort state at early boot. I have been able to
>>> reproduce it reliably, although from the little information I have I
>>> think it is related to a cache issue because we don't trap cache
>>> maintenance instructions by set/way.
>>>
>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>>> working on a given cache level by S/W. Because the OS is not allowed to
>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>>> cache. "The expected usage of the cache maintenance that operate by
>>> set/way is associated with powerdown and powerup of caches, if this is
>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>>
>>> Those instructions will target a local processor and usually working in
>>> batch for nuking the cache. This means if the vCPU is migrated to
>>> another pCPU in the middle of the process, the cache may not be cleaned.
>>> This would result to data corruption and potential crash of the OS.
>>
>> I don't quite understand the failure mode here: Why does vCPU migration
>> cause cache inconsistency in the middle of one of these "cleans", but
>> not under normal operation?
> 
> Because they target a specific S/W cache level whereas other cache 
> operations are working with VA.
> 
> To make it short, the other VA cache instructions will work to Poinut of 
> Coherency/Point of Unification and guarantee that the caches will be 
> consistent. For more details see B2.2.6 in ARM DDI 046C.c.
> 
>>
>>> For those been worry about the performance impact, I have looked at the
>>> current use of S/W instructions:
>>>      - Linux Arm64: The last used in the kernel was beginning of 2015
>>>      - Linux Arm32: Still use S/W for boot and secondary CPU 
>>> bring-up. No
>>> plan to change.
>>>      - UEFI: A couple of use in UEFI, but I have heard they plan to
>>> remove them (need confirmation).
>>>
>>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>>> state S/W instructions are not easily virtualizable, I would expect
>>> guest OSes developers to try there best to limit the use of the
>>> instructions.
>>>
>>> To limit the performance impact, we could introduce a guest option to
>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>>> will be disabled.
>>>
>>> Now regarding the hardware domain. At the moment, it has its RAM direct
>>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>>> limited benefits (see why above). In that case I would suggest to impose
>>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>>> Again, a command line option could be introduced here.
>>>
>>> Any feedbacks on the approach will be welcomed.
>>
>> I still don't entirely understand the underlying failure mode, but there
>> are a couple of things we could consider:
>>
>> 1. Automatically disabling 'vcpu migration' when caching is turned off.
>> This wouldn't prevent a vcpu from being preempted, just from being run
>> somewhere else.
> 
> This suggest the guest will directly perform S/W, right? So you leave 
> the possibility to the guest to flush all caches the vCPU can access. 
> This an easy way for the guest to affect the cache entry of other guests.
> 
> I think this would help some potential data attack.
> 
>>
>> 2. It sounds like rather than using PoD, you could use the
>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>> entry which cause a specific kind of HAP fault when accessed.  The fault
>> handler then looks in the p2m entry, and if it finds an otherwise valid
>> entry, it just fixes the "misconfigured" bits and continues.
> 
> I thought about this. But when do you set the entry to misconfigured?
> 
> If you take the example of Linux 32-bit. There are a couple of full 
> cache clean during the boot of uni-processor. So you would need to go 
> through the p2m multiple time and reset the access bits.

To complete here, I agree that using PoD to emulate S/W is not great. 
But after looking at all the other solutions that was the only one that 
could provide a better isolation of the guests and provide some decent 
performance.

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-05 18:39 [RFC] xen/arm: Handling cache maintenance instructions by set/way Julien Grall
                   ` (2 preceding siblings ...)
  2017-12-06 12:28 ` George Dunlap
@ 2017-12-06 15:10 ` Konrad Rzeszutek Wilk
  2017-12-06 15:19   ` Julien Grall
  3 siblings, 1 reply; 41+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-12-06 15:10 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Jan Beulich, Andrew Cooper, xen-devel

.snip..
> The suggested policy is based on the KVM one:
> 	- If we trap a S/W instructions, we enable VM trapping (e.g HCR_EL2.TVM) to
> detect cache being turned on/off, and do a full clean.
> 	- We flush the caches on both caches being turned on and off.
> 	- Once the caches are enabled, we stop trapping VM instructions.
> 
> Doing a full clean will require to go through the P2M and flush the entries
> one by one. At the moment, all the memory is mapped. As you can imagine
> flushing guest with hundreds of MB will take a very long time (Linux timeout
> during CPU bring).

Yikes. Since you mention 'based on the KVM one' - did they solve this particular
problem or do they also have the same issue?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 12:58   ` Julien Grall
  2017-12-06 13:01     ` Julien Grall
@ 2017-12-06 15:15     ` Jan Beulich
  2017-12-06 17:52       ` Julien Grall
  2017-12-06 17:49     ` George Dunlap
  2017-12-08  8:03     ` Tim Deegan
  3 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2017-12-06 15:15 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

>>> On 06.12.17 at 13:58, <julien.grall@linaro.org> wrote:
> On 12/06/2017 12:28 PM, George Dunlap wrote:
>> 2. It sounds like rather than using PoD, you could use the
>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>> entry which cause a specific kind of HAP fault when accessed.  The fault
>> handler then looks in the p2m entry, and if it finds an otherwise valid
>> entry, it just fixes the "misconfigured" bits and continues.
> 
> I thought about this. But when do you set the entry to misconfigured?

What we do in x86 is that we flag all entries at the top level as
misconfigured at any time where otherwise we would have to
walk the full tree. Upon access, the misconfigured flag is being
propagated down the page table hierarchy, with only the
intermediate and leaf entries needed for the current access
becoming properly configured again. In your case, as long as
only a limited set of leaf entries are being touched before any
S/W emulation is needed, you'd be able to skip all misconfigured
entries in your traversal, just like with PoD you'd skip
unpopulated ones.

> If you take the example of Linux 32-bit. There are a couple of full 
> cache clean during the boot of uni-processor. So you would need to go 
> through the p2m multiple time and reset the access bits.

The proposed mechanism isn't really similar to traditional accessed
bit handling. If there is no other use for the accessed bit (assuming
there is one in ARM PTEs in the first place), and as long as the bit
being clear gives you some sort of signal (or x86 this and the dirty
bit are being updated by hardware, as kind of a side effect of a
page table walk), it could of course be used for the purpose here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 15:10 ` Konrad Rzeszutek Wilk
@ 2017-12-06 15:19   ` Julien Grall
  2017-12-06 15:24     ` George Dunlap
  0 siblings, 1 reply; 41+ messages in thread
From: Julien Grall @ 2017-12-06 15:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Jan Beulich, Andrew Cooper, xen-devel

Hi Konrad,

On 12/06/2017 03:10 PM, Konrad Rzeszutek Wilk wrote:
> .snip..
>> The suggested policy is based on the KVM one:
>> 	- If we trap a S/W instructions, we enable VM trapping (e.g HCR_EL2.TVM) to
>> detect cache being turned on/off, and do a full clean.
>> 	- We flush the caches on both caches being turned on and off.
>> 	- Once the caches are enabled, we stop trapping VM instructions.
>>
>> Doing a full clean will require to go through the P2M and flush the entries
>> one by one. At the moment, all the memory is mapped. As you can imagine
>> flushing guest with hundreds of MB will take a very long time (Linux timeout
>> during CPU bring).
> 
> Yikes. Since you mention 'based on the KVM one' - did they solve this particular
> problem or do they also have the same issue?

KVM is using populate on demand by default.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 15:19   ` Julien Grall
@ 2017-12-06 15:24     ` George Dunlap
  2017-12-06 15:26       ` Julien Grall
  0 siblings, 1 reply; 41+ messages in thread
From: George Dunlap @ 2017-12-06 15:24 UTC (permalink / raw)
  To: Julien Grall, Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Jan Beulich, Andrew Cooper, xen-devel

On 12/06/2017 03:19 PM, Julien Grall wrote:
> Hi Konrad,
> 
> On 12/06/2017 03:10 PM, Konrad Rzeszutek Wilk wrote:
>> .snip..
>>> The suggested policy is based on the KVM one:
>>>     - If we trap a S/W instructions, we enable VM trapping (e.g
>>> HCR_EL2.TVM) to
>>> detect cache being turned on/off, and do a full clean.
>>>     - We flush the caches on both caches being turned on and off.
>>>     - Once the caches are enabled, we stop trapping VM instructions.
>>>
>>> Doing a full clean will require to go through the P2M and flush the
>>> entries
>>> one by one. At the moment, all the memory is mapped. As you can imagine
>>> flushing guest with hundreds of MB will take a very long time (Linux
>>> timeout
>>> during CPU bring).
>>
>> Yikes. Since you mention 'based on the KVM one' - did they solve this
>> particular
>> problem or do they also have the same issue?
> 
> KVM is using populate on demand by default.

If I understand properly, it's probably more accurate to say that KVM
uses "allocate on demand".  The complicated part of populate-on-demand
is the fact that it's not allowed to allocate anything.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 15:24     ` George Dunlap
@ 2017-12-06 15:26       ` Julien Grall
  0 siblings, 0 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-06 15:26 UTC (permalink / raw)
  To: George Dunlap, Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Jan Beulich, Andrew Cooper, xen-devel



On 12/06/2017 03:24 PM, George Dunlap wrote:
> On 12/06/2017 03:19 PM, Julien Grall wrote:
>> Hi Konrad,
>>
>> On 12/06/2017 03:10 PM, Konrad Rzeszutek Wilk wrote:
>>> .snip..
>>>> The suggested policy is based on the KVM one:
>>>>      - If we trap a S/W instructions, we enable VM trapping (e.g
>>>> HCR_EL2.TVM) to
>>>> detect cache being turned on/off, and do a full clean.
>>>>      - We flush the caches on both caches being turned on and off.
>>>>      - Once the caches are enabled, we stop trapping VM instructions.
>>>>
>>>> Doing a full clean will require to go through the P2M and flush the
>>>> entries
>>>> one by one. At the moment, all the memory is mapped. As you can imagine
>>>> flushing guest with hundreds of MB will take a very long time (Linux
>>>> timeout
>>>> during CPU bring).
>>>
>>> Yikes. Since you mention 'based on the KVM one' - did they solve this
>>> particular
>>> problem or do they also have the same issue?
>>
>> KVM is using populate on demand by default.
> 
> If I understand properly, it's probably more accurate to say that KVM
> uses "allocate on demand".  The complicated part of populate-on-demand
> is the fact that it's not allowed to allocate anything.

Hmmm yes. You are right on the wording.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 12:58   ` Julien Grall
  2017-12-06 13:01     ` Julien Grall
  2017-12-06 15:15     ` Jan Beulich
@ 2017-12-06 17:49     ` George Dunlap
  2017-12-07 13:52       ` Julien Grall
  2017-12-08  8:03     ` Tim Deegan
  3 siblings, 1 reply; 41+ messages in thread
From: George Dunlap @ 2017-12-06 17:49 UTC (permalink / raw)
  To: Julien Grall, xen-devel, Jan Beulich, Andrew Cooper,
	George Dunlap, Stefano Stabellini, Andre Przywara, Tim Deegan

On 12/06/2017 12:58 PM, Julien Grall wrote:
> Hi George,
> 
> On 12/06/2017 12:28 PM, George Dunlap wrote:
>> On 12/05/2017 06:39 PM, Julien Grall wrote:
>>> Hi all,
>>>
>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>>> on the approach. I have a WIP branch I could share if that interest
>>> people.
>>>
>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>>> 0 is in data/prefetch abort state at early boot. I have been able to
>>> reproduce it reliably, although from the little information I have I
>>> think it is related to a cache issue because we don't trap cache
>>> maintenance instructions by set/way.
>>>
>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>>> working on a given cache level by S/W. Because the OS is not allowed to
>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>>> cache. "The expected usage of the cache maintenance that operate by
>>> set/way is associated with powerdown and powerup of caches, if this is
>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>>
>>> Those instructions will target a local processor and usually working in
>>> batch for nuking the cache. This means if the vCPU is migrated to
>>> another pCPU in the middle of the process, the cache may not be cleaned.
>>> This would result to data corruption and potential crash of the OS.
>>
>> I don't quite understand the failure mode here: Why does vCPU migration
>> cause cache inconsistency in the middle of one of these "cleans", but
>> not under normal operation?
> 
> Because they target a specific S/W cache level whereas other cache
> operations are working with VA.
> 
> To make it short, the other VA cache instructions will work to Poinut of
> Coherency/Point of Unification and guarantee that the caches will be
> consistent. For more details see B2.2.6 in ARM DDI 046C.c.

I skimmed that section, and I'm not much the wiser.

Just to be clear, this is my question.

Suppose we have the following sequence of events (where vN[pM] means
vcpu N running on pcpu M):

Start with A == 0

1. v0[p1] Read A
  p1 has 'A==0' in the cache
2. scheduler migrates v1 to p0
3. v0[p0] A=2
  p0 has 'A==2' in the cache
4 scheduler migrates v0 to p1
5 v0[p1] Read A

Now, I presume that with the guest not doing anything, the Read of A at
#5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
and p1's version of A gets "invalidated" (to use the terminology from
the section mentioned above).

So my question is, how does *adding* cache flushing of any sort end up
violating the integrity in a situation like the above?

>>> For those been worry about the performance impact, I have looked at the
>>> current use of S/W instructions:
>>>      - Linux Arm64: The last used in the kernel was beginning of 2015
>>>      - Linux Arm32: Still use S/W for boot and secondary CPU
>>> bring-up. No
>>> plan to change.
>>>      - UEFI: A couple of use in UEFI, but I have heard they plan to
>>> remove them (need confirmation).
>>>
>>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>>> state S/W instructions are not easily virtualizable, I would expect
>>> guest OSes developers to try there best to limit the use of the
>>> instructions.
>>>
>>> To limit the performance impact, we could introduce a guest option to
>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>>> will be disabled.
>>>
>>> Now regarding the hardware domain. At the moment, it has its RAM direct
>>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>>> limited benefits (see why above). In that case I would suggest to impose
>>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>>> Again, a command line option could be introduced here.
>>>
>>> Any feedbacks on the approach will be welcomed.
>>
>> I still don't entirely understand the underlying failure mode, but there
>> are a couple of things we could consider:
>>
>> 1. Automatically disabling 'vcpu migration' when caching is turned off.
>> This wouldn't prevent a vcpu from being preempted, just from being run
>> somewhere else.
> 
> This suggest the guest will directly perform S/W, right? So you leave
> the possibility to the guest to flush all caches the vCPU can access.
> This an easy way for the guest to affect the cache entry of other guests.
> 
> I think this would help some potential data attack.

Well, it's the equivalent of your "imposing vcpu pinning" solution
above, but only temporary.  Was that suggestion meant to allow the
hardware domain to directly perform S/W?

>> 2. It sounds like rather than using PoD, you could use the
>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>> entry which cause a specific kind of HAP fault when accessed.  The fault
>> handler then looks in the p2m entry, and if it finds an otherwise valid
>> entry, it just fixes the "misconfigured" bits and continues.
> 
> I thought about this. But when do you set the entry to misconfigured?
> 
> If you take the example of Linux 32-bit. There are a couple of full
> cache clean during the boot of uni-processor. So you would need to go
> through the p2m multiple time and reset the access bits.

Do you want to reset the p2m multiple times?  I thought the goal was
simply to keep the amount of p2m space you need to flush to a minimum;
if you expect the memory which has been faulted in by the *last* flush
to be relatively small, you could just always flush all memory that had
been touched to that point.

If you *do* need to go through the p2m multiple times, then
misconfiguration is a much better option than PoD.  In PoD, once a page
has data on it, it can't be removed from the p2m anymore.  For the
misconfiguration technique, you can go through and misconfigure the
entries in the top-level p2m table as many times as you want.  The whole
reason for doing it on x86 is that it's a relatively lightweight
operation: we use it to modify MMIO mappings, to enable or disable
logdirty for migrate, &c.

(This of course depends on being able to effectively misconfigure
top-level entries of the p2m on ARM.)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 15:15     ` Jan Beulich
@ 2017-12-06 17:52       ` Julien Grall
  2017-12-07  9:39         ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Julien Grall @ 2017-12-06 17:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

Hi Jan,

On 12/06/2017 03:15 PM, Jan Beulich wrote:
>>>> On 06.12.17 at 13:58, <julien.grall@linaro.org> wrote:
>> On 12/06/2017 12:28 PM, George Dunlap wrote:
>>> 2. It sounds like rather than using PoD, you could use the
>>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>>> entry which cause a specific kind of HAP fault when accessed.  The fault
>>> handler then looks in the p2m entry, and if it finds an otherwise valid
>>> entry, it just fixes the "misconfigured" bits and continues.
>>
>> I thought about this. But when do you set the entry to misconfigured?
> 
> What we do in x86 is that we flag all entries at the top level as
> misconfigured at any time where otherwise we would have to
> walk the full tree. Upon access, the misconfigured flag is being
> propagated down the page table hierarchy, with only the
> intermediate and leaf entries needed for the current access
> becoming properly configured again. In your case, as long as
> only a limited set of leaf entries are being touched before any
> S/W emulation is needed, you'd be able to skip all misconfigured
> entries in your traversal, just like with PoD you'd skip
> unpopulated ones.

Oh, what you call "misconfigured bits" would be clearing the valid bit 
of an entry on Arm. The entry would be considered invalid, but it is 
still possible to store informations (the rest of the bits are ignored 
by the hardware).

But I think this is bringing another class of problem. When a 
misconfigured is accessed, we would need to clean & invalidate the cache 
for that region.

At the moment, Xen only supports 4KB page granularity. So the region 
would be either 4KB, 2MB or 1GB. Flushing 2MB and 1GB region will take 
time because you can only clean & flush a line at the time. From the 
Arm, cacheline size can range from 16 bytes to 2048 bytes. This means we 
may want to preempt even in the middle of region to avoid blocking Xen 
for too long.

I think we need to clean & invalidate the region at least in the 
following places:
	1) The guest is read/writing a region
	2) Xen is accessing the region because of an hypercall

I will leave 1) aside as I think it is clear for everyone the reason of 
the clear & invalidate as long as how to preempt.

For 2), if we access a "misconfigured page" we would need to clean to 
avoid stall data. I think in that case, preemption would be difficult. 
Indeed we would need to modify all the hypercall to report back the 
preemption and restart again.

On a side node, soon we will need to support 64KB page granularity 
because this is the only way to handle 52-bit PA on Arm. In that case, 
region would be 64KB, 512MB, 4TB. Whether we will support 4TB is not 
decided, but I think 512MB should be.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 17:52       ` Julien Grall
@ 2017-12-07  9:39         ` Jan Beulich
  2017-12-07 15:22           ` Julien Grall
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2017-12-07  9:39 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

>>> On 06.12.17 at 18:52, <julien.grall@linaro.org> wrote:
> On 12/06/2017 03:15 PM, Jan Beulich wrote:
>> What we do in x86 is that we flag all entries at the top level as
>> misconfigured at any time where otherwise we would have to
>> walk the full tree. Upon access, the misconfigured flag is being
>> propagated down the page table hierarchy, with only the
>> intermediate and leaf entries needed for the current access
>> becoming properly configured again. In your case, as long as
>> only a limited set of leaf entries are being touched before any
>> S/W emulation is needed, you'd be able to skip all misconfigured
>> entries in your traversal, just like with PoD you'd skip
>> unpopulated ones.
> 
> Oh, what you call "misconfigured bits" would be clearing the valid bit 
> of an entry on Arm. The entry would be considered invalid, but it is 
> still possible to store informations (the rest of the bits are ignored 
> by the hardware).

Well, on x86 we don't always have a separate "valid" bit, hence
we set something else to a value which will cause a suitable VM
exit when being accessed by the guest.

> But I think this is bringing another class of problem. When a 
> misconfigured is accessed, we would need to clean & invalidate the cache 
> for that region.

Why? (Please remember that I'm an x86 person, so may simply
not be aware of extra constraints ARM has.) The data in the
cache (if any) doesn't change while the mapping is invalid (unless
Xen modifies it, but if there was a coherency problem between
Xen and guest accesses, you'd have the issue with hypercalls
which you describe later independent of the approach suggested
here).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 17:49     ` George Dunlap
@ 2017-12-07 13:52       ` Julien Grall
  2017-12-07 14:25         ` Jan Beulich
  2017-12-07 14:53         ` Marc Zyngier
  0 siblings, 2 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-07 13:52 UTC (permalink / raw)
  To: George Dunlap, xen-devel, Jan Beulich, Andrew Cooper,
	George Dunlap, Stefano Stabellini, Andre Przywara, Tim Deegan
  Cc: Marc Zyngier

(+ Marc)

Hi,

@Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
me if I am wrong.

Before answering to the rest of the e-mail, let me reinforce what I said 
in my first e-mail. Set/Way are very complex to emulate and an OS using 
them should never expect good performance in virtualization context. The 
difficulty is clearly spell out in the Arm Arm.

So the main goal here is to workaround those software.

On 06/12/17 17:49, George Dunlap wrote:
> On 12/06/2017 12:58 PM, Julien Grall wrote:
>> Hi George,
>>
>> On 12/06/2017 12:28 PM, George Dunlap wrote:
>>> On 12/05/2017 06:39 PM, Julien Grall wrote:
>>>> Hi all,
>>>>
>>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>>>> on the approach. I have a WIP branch I could share if that interest
>>>> people.
>>>>
>>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>>>> 0 is in data/prefetch abort state at early boot. I have been able to
>>>> reproduce it reliably, although from the little information I have I
>>>> think it is related to a cache issue because we don't trap cache
>>>> maintenance instructions by set/way.
>>>>
>>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>>>> working on a given cache level by S/W. Because the OS is not allowed to
>>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>>>> cache. "The expected usage of the cache maintenance that operate by
>>>> set/way is associated with powerdown and powerup of caches, if this is
>>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>>>
>>>> Those instructions will target a local processor and usually working in
>>>> batch for nuking the cache. This means if the vCPU is migrated to
>>>> another pCPU in the middle of the process, the cache may not be cleaned.
>>>> This would result to data corruption and potential crash of the OS.
>>>
>>> I don't quite understand the failure mode here: Why does vCPU migration
>>> cause cache inconsistency in the middle of one of these "cleans", but
>>> not under normal operation?
>>
>> Because they target a specific S/W cache level whereas other cache
>> operations are working with VA.
>>
>> To make it short, the other VA cache instructions will work to Poinut of
>> Coherency/Point of Unification and guarantee that the caches will be
>> consistent. For more details see B2.2.6 in ARM DDI 046C.c.
> 
> I skimmed that section, and I'm not much the wiser.
> 
> Just to be clear, this is my question.
> 
> Suppose we have the following sequence of events (where vN[pM] means
> vcpu N running on pcpu M):
> 
> Start with A == 0
> 
> 1. v0[p1] Read A
>    p1 has 'A==0' in the cache
> 2. scheduler migrates v1 to p0
> 3. v0[p0] A=2
>    p0 has 'A==2' in the cache
> 4 scheduler migrates v0 to p1
> 5 v0[p1] Read A
> 
> Now, I presume that with the guest not doing anything, the Read of A at
> #5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
> or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
> and p1's version of A gets "invalidated" (to use the terminology from
> the section mentioned above).

Caches on Arm are coherent and are controlled by the attributes in the 
page-tables. Imagine the region is normal cacheable and inner-shareable, 
a data synchronization barrier in #4 will ensure the visibility of the A 
to p1. So A will be read as 2.

> 
> So my question is, how does *adding* cache flushing of any sort end up
> violating the integrity in a situation like the above?

Because the integrity is based on the memory attributes in the 
page-tables. S/W instructions work directly on the cache and will break 
the coherency. Marc pointed me to his talk [1] that explain cache on Arm 
and also the set/way problem (see from slide 8).

> 
>>>> For those been worry about the performance impact, I have looked at the
>>>> current use of S/W instructions:
>>>>       - Linux Arm64: The last used in the kernel was beginning of 2015
>>>>       - Linux Arm32: Still use S/W for boot and secondary CPU
>>>> bring-up. No
>>>> plan to change.
>>>>       - UEFI: A couple of use in UEFI, but I have heard they plan to
>>>> remove them (need confirmation).
>>>>
>>>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>>>> state S/W instructions are not easily virtualizable, I would expect
>>>> guest OSes developers to try there best to limit the use of the
>>>> instructions.
>>>>
>>>> To limit the performance impact, we could introduce a guest option to
>>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>>>> will be disabled.
>>>>
>>>> Now regarding the hardware domain. At the moment, it has its RAM direct
>>>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>>>> limited benefits (see why above). In that case I would suggest to impose
>>>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>>>> Again, a command line option could be introduced here.
>>>>
>>>> Any feedbacks on the approach will be welcomed.
>>>
>>> I still don't entirely understand the underlying failure mode, but there
>>> are a couple of things we could consider:
>>>
>>> 1. Automatically disabling 'vcpu migration' when caching is turned off.
>>> This wouldn't prevent a vcpu from being preempted, just from being run
>>> somewhere else.
>>
>> This suggest the guest will directly perform S/W, right? So you leave
>> the possibility to the guest to flush all caches the vCPU can access.
>> This an easy way for the guest to affect the cache entry of other guests.
>>
>> I think this would help some potential data attack.
> 
> Well, it's the equivalent of your "imposing vcpu pinning" solution
> above, but only temporary.  Was that suggestion meant to allow the
> hardware domain to directly perform S/W?

Yes for the hardware domain only because it is more trusted IHMO. I 
though you meant for every guests. The problem I can see here is you 
would need to trap cache-toggling. When trapping that, you have to trap 
all the virtual memory traps. This means:

Non-secure EL1 using AArch64: SCTLR_EL1, TTBR0_EL1, TTBR1_EL1, TCR_EL1, 
ESR_EL1,
FAR_EL1, AFSR0_EL1, AFSR1_EL1, MAIR_EL1, AMAIR_EL1, CONTEXTIDR_EL1.
Non-secure EL1 using AArch32: SCTLR, TTBR0, TTBR1, TTBCR, TTBCR2, DACR, 
DFSR,
IFSR, DFAR, IFAR, ADFSR, AIFSR, PRRR, NMRR, MAIR0, MAIR1, AMAIR0, AMAIR1,
CONTEXTIDR.

Those registers are accessed very often, so you will have a performance 
impact for the whole life of the guest.

However, looking at Marc's slide. This would not work when booting 
32-bit hardware domain on ARMv8 because system caches might be present.

> 
>>> 2. It sounds like rather than using PoD, you could use the
>>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>>> entry which cause a specific kind of HAP fault when accessed.  The fault
>>> handler then looks in the p2m entry, and if it finds an otherwise valid
>>> entry, it just fixes the "misconfigured" bits and continues.
>>
>> I thought about this. But when do you set the entry to misconfigured?
>>
>> If you take the example of Linux 32-bit. There are a couple of full
>> cache clean during the boot of uni-processor. So you would need to go
>> through the p2m multiple time and reset the access bits.
> 
> Do you want to reset the p2m multiple times?  I thought the goal was
> simply to keep the amount of p2m space you need to flush to a minimum;
> if you expect the memory which has been faulted in by the *last* flush
> to be relatively small, you could just always flush all memory that had
> been touched to that point.
> 
> If you *do* need to go through the p2m multiple times, then
> misconfiguration is a much better option than PoD.  In PoD, once a page
> has data on it, it can't be removed from the p2m anymore.  For the
> misconfiguration technique, you can go through and misconfigure the
> entries in the top-level p2m table as many times as you want.  The whole
> reason for doing it on x86 is that it's a relatively lightweight
> operation: we use it to modify MMIO mappings, to enable or disable
> logdirty for migrate, &c.

Does this also work when you share the page-tables with the IOMMU? It 
just occurred to me that for both PoD and "misconfigured bits" we would 
get into trouble because page-tables are shared with the IOMMU.

But I guess, it would be acceptable to say "you use S/W instructions in 
your OS, so you have to pay a worst performance price unless you fix 
your OS".

> 
> (This of course depends on being able to effectively misconfigure
> top-level entries of the p2m on ARM.)

More on an the answer to Jan's e-mail.

Cheers,

[1] 
https://events.linuxfoundation.org/sites/events/files/slides/slides_10.pdf

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 13:52       ` Julien Grall
@ 2017-12-07 14:25         ` Jan Beulich
  2017-12-07 14:53         ` Marc Zyngier
  1 sibling, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2017-12-07 14:25 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Marc Zyngier, Andre Przywara, Tim Deegan,
	George Dunlap, George Dunlap, Andrew Cooper, xen-devel

>>> On 07.12.17 at 14:52, <julien.grall@linaro.org> wrote:
> On 06/12/17 17:49, George Dunlap wrote:
>> Do you want to reset the p2m multiple times?  I thought the goal was
>> simply to keep the amount of p2m space you need to flush to a minimum;
>> if you expect the memory which has been faulted in by the *last* flush
>> to be relatively small, you could just always flush all memory that had
>> been touched to that point.
>> 
>> If you *do* need to go through the p2m multiple times, then
>> misconfiguration is a much better option than PoD.  In PoD, once a page
>> has data on it, it can't be removed from the p2m anymore.  For the
>> misconfiguration technique, you can go through and misconfigure the
>> entries in the top-level p2m table as many times as you want.  The whole
>> reason for doing it on x86 is that it's a relatively lightweight
>> operation: we use it to modify MMIO mappings, to enable or disable
>> logdirty for migrate, &c.
> 
> Does this also work when you share the page-tables with the IOMMU? It 
> just occurred to me that for both PoD and "misconfigured bits" we would 
> get into trouble because page-tables are shared with the IOMMU.

PoD and IOMMU are incompatible on x86 at present.

The bits we use for "mis-configuring" entries are ignored by the IOMMU,
which is not a problem since all we use this approach for (right now) is
to update the memory type (i.e. cacheability) for possibly huge ranges.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 13:52       ` Julien Grall
  2017-12-07 14:25         ` Jan Beulich
@ 2017-12-07 14:53         ` Marc Zyngier
  2017-12-07 15:45           ` Jan Beulich
  1 sibling, 1 reply; 41+ messages in thread
From: Marc Zyngier @ 2017-12-07 14:53 UTC (permalink / raw)
  To: Julien Grall, George Dunlap, xen-devel, Jan Beulich,
	Andrew Cooper, George Dunlap, Stefano Stabellini, Andre Przywara,
	Tim Deegan

On 07/12/17 13:52, Julien Grall wrote:
> (+ Marc)
> 
> Hi,
> 
> @Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
> me if I am wrong.
> 
> Before answering to the rest of the e-mail, let me reinforce what I said 
> in my first e-mail. Set/Way are very complex to emulate and an OS using 
> them should never expect good performance in virtualization context. The 
> difficulty is clearly spell out in the Arm Arm.

It is actually even worse than that. Software using set/way operations
is simply not virtualizable, full stop. Yes, we paper over it in ugly
ways, but nobody should really use set/way.

There is exactly one case where set/way makes sense, and that's when
you're the only CPU left in the system, your MMU is off, and you're
about to go down.

> So the main goal here is to workaround those software.

Quite. Said SW is usually a 32bit Linux kernel.

> 
> On 06/12/17 17:49, George Dunlap wrote:
>> On 12/06/2017 12:58 PM, Julien Grall wrote:
>>> Hi George,
>>>
>>> On 12/06/2017 12:28 PM, George Dunlap wrote:
>>>> On 12/05/2017 06:39 PM, Julien Grall wrote:
>>>>> Hi all,
>>>>>
>>>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>>>>> on the approach. I have a WIP branch I could share if that interest
>>>>> people.
>>>>>
>>>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>>>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>>>>> 0 is in data/prefetch abort state at early boot. I have been able to
>>>>> reproduce it reliably, although from the little information I have I
>>>>> think it is related to a cache issue because we don't trap cache
>>>>> maintenance instructions by set/way.
>>>>>
>>>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>>>>> working on a given cache level by S/W. Because the OS is not allowed to
>>>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>>>>> cache. "The expected usage of the cache maintenance that operate by
>>>>> set/way is associated with powerdown and powerup of caches, if this is
>>>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>>>>
>>>>> Those instructions will target a local processor and usually working in
>>>>> batch for nuking the cache. This means if the vCPU is migrated to
>>>>> another pCPU in the middle of the process, the cache may not be cleaned.
>>>>> This would result to data corruption and potential crash of the OS.
>>>>
>>>> I don't quite understand the failure mode here: Why does vCPU migration
>>>> cause cache inconsistency in the middle of one of these "cleans", but
>>>> not under normal operation?
>>>
>>> Because they target a specific S/W cache level whereas other cache
>>> operations are working with VA.
>>>
>>> To make it short, the other VA cache instructions will work to Poinut of
>>> Coherency/Point of Unification and guarantee that the caches will be
>>> consistent. For more details see B2.2.6 in ARM DDI 046C.c.
>>
>> I skimmed that section, and I'm not much the wiser.
>>
>> Just to be clear, this is my question.
>>
>> Suppose we have the following sequence of events (where vN[pM] means
>> vcpu N running on pcpu M):
>>
>> Start with A == 0
>>
>> 1. v0[p1] Read A
>>    p1 has 'A==0' in the cache
>> 2. scheduler migrates v1 to p0
>> 3. v0[p0] A=2
>>    p0 has 'A==2' in the cache
>> 4 scheduler migrates v0 to p1
>> 5 v0[p1] Read A
>>
>> Now, I presume that with the guest not doing anything, the Read of A at
>> #5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
>> or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
>> and p1's version of A gets "invalidated" (to use the terminology from
>> the section mentioned above).
> 
> Caches on Arm are coherent and are controlled by the attributes in the 
> page-tables. Imagine the region is normal cacheable and inner-shareable, 
> a data synchronization barrier in #4 will ensure the visibility of the A 
> to p1. So A will be read as 2.
> 
>>
>> So my question is, how does *adding* cache flushing of any sort end up
>> violating the integrity in a situation like the above?
> 
> Because the integrity is based on the memory attributes in the 
> page-tables. S/W instructions work directly on the cache and will break 
> the coherency. Marc pointed me to his talk [1] that explain cache on Arm 
> and also the set/way problem (see from slide 8).

On top of bypassing the coherency, S/W CMOs do not prevent lines from
migrating from one CPU to another. So you could happily be flushing by
S/W, and still end up with dirty lines in your cache. Success!

At that point, performance is the least of your worries.

> 
>>
>>>>> For those been worry about the performance impact, I have looked at the
>>>>> current use of S/W instructions:
>>>>>       - Linux Arm64: The last used in the kernel was beginning of 2015
>>>>>       - Linux Arm32: Still use S/W for boot and secondary CPU
>>>>> bring-up. No
>>>>> plan to change.
>>>>>       - UEFI: A couple of use in UEFI, but I have heard they plan to
>>>>> remove them (need confirmation).
>>>>>
>>>>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>>>>> state S/W instructions are not easily virtualizable, I would expect
>>>>> guest OSes developers to try there best to limit the use of the
>>>>> instructions.
>>>>>
>>>>> To limit the performance impact, we could introduce a guest option to
>>>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>>>>> will be disabled.
>>>>>
>>>>> Now regarding the hardware domain. At the moment, it has its RAM direct
>>>>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>>>>> limited benefits (see why above). In that case I would suggest to impose
>>>>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>>>>> Again, a command line option could be introduced here.
>>>>>
>>>>> Any feedbacks on the approach will be welcomed.
>>>>
>>>> I still don't entirely understand the underlying failure mode, but there
>>>> are a couple of things we could consider:
>>>>
>>>> 1. Automatically disabling 'vcpu migration' when caching is turned off.
>>>> This wouldn't prevent a vcpu from being preempted, just from being run
>>>> somewhere else.
>>>
>>> This suggest the guest will directly perform S/W, right? So you leave
>>> the possibility to the guest to flush all caches the vCPU can access.
>>> This an easy way for the guest to affect the cache entry of other guests.
>>>
>>> I think this would help some potential data attack.
>>
>> Well, it's the equivalent of your "imposing vcpu pinning" solution
>> above, but only temporary.  Was that suggestion meant to allow the
>> hardware domain to directly perform S/W?
> 
> Yes for the hardware domain only because it is more trusted IHMO. I 
> though you meant for every guests. The problem I can see here is you 
> would need to trap cache-toggling. When trapping that, you have to trap 
> all the virtual memory traps. This means:
> 
> Non-secure EL1 using AArch64: SCTLR_EL1, TTBR0_EL1, TTBR1_EL1, TCR_EL1, 
> ESR_EL1,
> FAR_EL1, AFSR0_EL1, AFSR1_EL1, MAIR_EL1, AMAIR_EL1, CONTEXTIDR_EL1.
> Non-secure EL1 using AArch32: SCTLR, TTBR0, TTBR1, TTBCR, TTBCR2, DACR, 
> DFSR,
> IFSR, DFAR, IFAR, ADFSR, AIFSR, PRRR, NMRR, MAIR0, MAIR1, AMAIR0, AMAIR1,
> CONTEXTIDR.
> 
> Those registers are accessed very often, so you will have a performance 
> impact for the whole life of the guest.
> 
> However, looking at Marc's slide. This would not work when booting 
> 32-bit hardware domain on ARMv8 because system caches might be present.

Yes, and this further outlines why using S/W is b0rken. You're not
guaranteed that all your cache hierarchy will implement S/W.

> 
>>
>>>> 2. It sounds like rather than using PoD, you could use the
>>>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>>>> entry which cause a specific kind of HAP fault when accessed.  The fault
>>>> handler then looks in the p2m entry, and if it finds an otherwise valid
>>>> entry, it just fixes the "misconfigured" bits and continues.
>>>
>>> I thought about this. But when do you set the entry to misconfigured?
>>>
>>> If you take the example of Linux 32-bit. There are a couple of full
>>> cache clean during the boot of uni-processor. So you would need to go
>>> through the p2m multiple time and reset the access bits.
>>
>> Do you want to reset the p2m multiple times?  I thought the goal was
>> simply to keep the amount of p2m space you need to flush to a minimum;
>> if you expect the memory which has been faulted in by the *last* flush
>> to be relatively small, you could just always flush all memory that had
>> been touched to that point.
>>
>> If you *do* need to go through the p2m multiple times, then
>> misconfiguration is a much better option than PoD.  In PoD, once a page
>> has data on it, it can't be removed from the p2m anymore.  For the
>> misconfiguration technique, you can go through and misconfigure the
>> entries in the top-level p2m table as many times as you want.  The whole
>> reason for doing it on x86 is that it's a relatively lightweight
>> operation: we use it to modify MMIO mappings, to enable or disable
>> logdirty for migrate, &c.
> 
> Does this also work when you share the page-tables with the IOMMU? It 
> just occurred to me that for both PoD and "misconfigured bits" we would 
> get into trouble because page-tables are shared with the IOMMU.
> 
> But I guess, it would be acceptable to say "you use S/W instructions in 
> your OS, so you have to pay a worst performance price unless you fix 
> your OS".

I think that's a very valid argument. It is definitely a case of "Don't
do that". Yes, a 32bit Linux kernel will be slow to boot under Xen. If
people care about speed, they will fix it (or boot a non compressed
guest kernel). I think correctness matters a lot more than speed.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07  9:39         ` Jan Beulich
@ 2017-12-07 15:22           ` Julien Grall
  2017-12-07 15:49             ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Julien Grall @ 2017-12-07 15:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Marc Zyngier, Andrew Cooper, xen-devel

(+ Marc)

@Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
me if I am wrong.

On 07/12/17 09:39, Jan Beulich wrote:
>>>> On 06.12.17 at 18:52, <julien.grall@linaro.org> wrote:
>> On 12/06/2017 03:15 PM, Jan Beulich wrote:
>>> What we do in x86 is that we flag all entries at the top level as
>>> misconfigured at any time where otherwise we would have to
>>> walk the full tree. Upon access, the misconfigured flag is being
>>> propagated down the page table hierarchy, with only the
>>> intermediate and leaf entries needed for the current access
>>> becoming properly configured again. In your case, as long as
>>> only a limited set of leaf entries are being touched before any
>>> S/W emulation is needed, you'd be able to skip all misconfigured
>>> entries in your traversal, just like with PoD you'd skip
>>> unpopulated ones.
>>
>> Oh, what you call "misconfigured bits" would be clearing the valid bit
>> of an entry on Arm. The entry would be considered invalid, but it is
>> still possible to store informations (the rest of the bits are ignored
>> by the hardware).
> 
> Well, on x86 we don't always have a separate "valid" bit, hence
> we set something else to a value which will cause a suitable VM
> exit when being accessed by the guest.
> 
>> But I think this is bringing another class of problem. When a
>> misconfigured is accessed, we would need to clean & invalidate the cache
>> for that region.
> 
> Why? (Please remember that I'm an x86 person, so may simply
> not be aware of extra constraints ARM has.) The data in the
> cache (if any) doesn't change while the mapping is invalid (unless
> Xen modifies it, but if there was a coherency problem between
> Xen and guest accesses, you'd have the issue with hypercalls
> which you describe later independent of the approach suggested
> here).

Caches on Arm are coherent and are controlled by attributes in the 
page-tables. The coherency is lost if you access a region with different 
memory attributes.

To take the hypercall case, we impose memory shared with the hypervisor 
or any other guests to have specific memory attributes. So this will 
ensure cache coherency. This applies to:
	- hypercall arguments passed via a pointer to guest memory
	- memory shared via the grant table mechanism
	- memory shared with the hypervisor (shared_info, vcpu_info, grant 
table...).

Now regarding access by a guest. Even though the entry is 
"misconfigured" in the guest page-tables, this same physical address may 
be have been mapped in other places (e.g Xen, guests...). Because of 
speculation, a line could have been pulled in the case. As we don't know 
the memory attribute used by the guest, you have to clean & invalidate 
that region on a guest access.

Getting back to the hypercall case, I am still trying to figure out if 
we need to clean & invalidate the buffer used when the guest entry is 
"misconfigured". I can't convince myself why this would not be 
necessary. I need to have a more thorough think.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 14:53         ` Marc Zyngier
@ 2017-12-07 15:45           ` Jan Beulich
  2017-12-07 16:04             ` Marc Zyngier
  2017-12-07 16:04             ` Julien Grall
  0 siblings, 2 replies; 41+ messages in thread
From: Jan Beulich @ 2017-12-07 15:45 UTC (permalink / raw)
  To: Marc Zyngier, Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
> On 07/12/17 13:52, Julien Grall wrote:
> There is exactly one case where set/way makes sense, and that's when
> you're the only CPU left in the system, your MMU is off, and you're
> about to go down.

With this and ...

> On top of bypassing the coherency, S/W CMOs do not prevent lines from
> migrating from one CPU to another. So you could happily be flushing by
> S/W, and still end up with dirty lines in your cache. Success!

... this I wonder what value emulating those insns then has in the first
place. Can't you as well simply skip and ignore them, with the same
(bad) result?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 15:22           ` Julien Grall
@ 2017-12-07 15:49             ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2017-12-07 15:49 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Marc Zyngier, Andre Przywara, Tim Deegan,
	George Dunlap, George Dunlap, Andrew Cooper, xen-devel

>>> On 07.12.17 at 16:22, <julien.grall@linaro.org> wrote:
> On 07/12/17 09:39, Jan Beulich wrote:
>>>>> On 06.12.17 at 18:52, <julien.grall@linaro.org> wrote:
>>> But I think this is bringing another class of problem. When a
>>> misconfigured is accessed, we would need to clean & invalidate the cache
>>> for that region.
>> 
>> Why? (Please remember that I'm an x86 person, so may simply
>> not be aware of extra constraints ARM has.) The data in the
>> cache (if any) doesn't change while the mapping is invalid (unless
>> Xen modifies it, but if there was a coherency problem between
>> Xen and guest accesses, you'd have the issue with hypercalls
>> which you describe later independent of the approach suggested
>> here).
> 
> Caches on Arm are coherent and are controlled by attributes in the 
> page-tables. The coherency is lost if you access a region with different 
> memory attributes.
> 
> To take the hypercall case, we impose memory shared with the hypervisor 
> or any other guests to have specific memory attributes. So this will 
> ensure cache coherency. This applies to:
> 	- hypercall arguments passed via a pointer to guest memory
> 	- memory shared via the grant table mechanism
> 	- memory shared with the hypervisor (shared_info, vcpu_info, grant 
> table...).
> 
> Now regarding access by a guest. Even though the entry is 
> "misconfigured" in the guest page-tables, this same physical address may 
> be have been mapped in other places (e.g Xen, guests...).

But that's not an issue specific to the situation here, i.e. multiple
mappings with different memory attributes would always be a
problem. Hence I assume you have code in place to deal with that.
By retaining the entry contents except for the valid bit (or
something else to allow you to gain control upon access) nothing
should really change for the rest of the hypervisor logic, provided
such entries are not explicitly being ignored on any of the involved
logic.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 15:45           ` Jan Beulich
@ 2017-12-07 16:04             ` Marc Zyngier
  2017-12-07 16:04             ` Julien Grall
  1 sibling, 0 replies; 41+ messages in thread
From: Marc Zyngier @ 2017-12-07 16:04 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

On 07/12/17 15:45, Jan Beulich wrote:
>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>> On 07/12/17 13:52, Julien Grall wrote:
>> There is exactly one case where set/way makes sense, and that's when
>> you're the only CPU left in the system, your MMU is off, and you're
>> about to go down.
> 
> With this and ...
> 
>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>> migrating from one CPU to another. So you could happily be flushing by
>> S/W, and still end up with dirty lines in your cache. Success!
> 
> ... this I wonder what value emulating those insns then has in the first
> place. Can't you as well simply skip and ignore them, with the same
> (bad) result?

Your call. You could perfectly decide not to emulate them and let the
guest shoot itself in the foot. That will make the validation of 32bit
Linux guests pretty simple (they will fail to boot on most platforms).

The choice we made on KVM is to emulate them slowly but safely, by
converting them into VA CMOs over the full address space. Not pretty,
and quite invasive. But at least I can boot a 32bit kernel with similar
guarantees the kernel would have had if it was on bare metal without any
system cache.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 15:45           ` Jan Beulich
  2017-12-07 16:04             ` Marc Zyngier
@ 2017-12-07 16:04             ` Julien Grall
  2017-12-07 16:44               ` George Dunlap
  1 sibling, 1 reply; 41+ messages in thread
From: Julien Grall @ 2017-12-07 16:04 UTC (permalink / raw)
  To: Jan Beulich, Marc Zyngier
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

Hi Jan,

On 07/12/17 15:45, Jan Beulich wrote:
>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>> On 07/12/17 13:52, Julien Grall wrote:
>> There is exactly one case where set/way makes sense, and that's when
>> you're the only CPU left in the system, your MMU is off, and you're
>> about to go down.
> 
> With this and ...
> 
>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>> migrating from one CPU to another. So you could happily be flushing by
>> S/W, and still end up with dirty lines in your cache. Success!
> 
> ... this I wonder what value emulating those insns then has in the first
> place. Can't you as well simply skip and ignore them, with the same
> (bad) result?

The result will be much much worst. Here a concrete example with a Linux 
Arm 32-bit:

	1) Cache enabled
	2) Decompress
	3) Nuke cache (S/W)
	4) Cache off
	5) Access new kernel

If you skip #3, the decompress data may not have reached the memory, so 
you would access stall data.

This would effectively mean we don't support Linux Arm 32-bit.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 16:04             ` Julien Grall
@ 2017-12-07 16:44               ` George Dunlap
  2017-12-07 16:58                 ` Marc Zyngier
  0 siblings, 1 reply; 41+ messages in thread
From: George Dunlap @ 2017-12-07 16:44 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich, Marc Zyngier
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Andrew Cooper, xen-devel

On 12/07/2017 04:04 PM, Julien Grall wrote:
> Hi Jan,
> 
> On 07/12/17 15:45, Jan Beulich wrote:
>>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>>> On 07/12/17 13:52, Julien Grall wrote:
>>> There is exactly one case where set/way makes sense, and that's when
>>> you're the only CPU left in the system, your MMU is off, and you're
>>> about to go down.
>>
>> With this and ...
>>
>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>> migrating from one CPU to another. So you could happily be flushing by
>>> S/W, and still end up with dirty lines in your cache. Success!
>>
>> ... this I wonder what value emulating those insns then has in the first
>> place. Can't you as well simply skip and ignore them, with the same
>> (bad) result?
> 
> The result will be much much worst. Here a concrete example with a Linux
> Arm 32-bit:
> 
>     1) Cache enabled
>     2) Decompress
>     3) Nuke cache (S/W)
>     4) Cache off
>     5) Access new kernel
> 
> If you skip #3, the decompress data may not have reached the memory, so
> you would access stall data.
> 
> This would effectively mean we don't support Linux Arm 32-bit.

So Marc said that #3 "doesn't make sense", since although it might be
the only cpu on in the system, you're not "about to go down"; but Linux
32-bit is doing that anyway.

It sounds like from the slides the purpose of #3 might be to get stuff
out of the D-cache into the I-cache.  But why is the cache turned off?
And why doesn't Linux use the VA-based flushes rather than the S/W flushes?

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 16:44               ` George Dunlap
@ 2017-12-07 16:58                 ` Marc Zyngier
  2017-12-07 18:06                   ` George Dunlap
  0 siblings, 1 reply; 41+ messages in thread
From: Marc Zyngier @ 2017-12-07 16:58 UTC (permalink / raw)
  To: George Dunlap, Julien Grall, Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Andrew Cooper, xen-devel

On 07/12/17 16:44, George Dunlap wrote:
> On 12/07/2017 04:04 PM, Julien Grall wrote:
>> Hi Jan,
>>
>> On 07/12/17 15:45, Jan Beulich wrote:
>>>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>>>> On 07/12/17 13:52, Julien Grall wrote:
>>>> There is exactly one case where set/way makes sense, and that's when
>>>> you're the only CPU left in the system, your MMU is off, and you're
>>>> about to go down.
>>>
>>> With this and ...
>>>
>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>>> migrating from one CPU to another. So you could happily be flushing by
>>>> S/W, and still end up with dirty lines in your cache. Success!
>>>
>>> ... this I wonder what value emulating those insns then has in the first
>>> place. Can't you as well simply skip and ignore them, with the same
>>> (bad) result?
>>
>> The result will be much much worst. Here a concrete example with a Linux
>> Arm 32-bit:
>>
>>     1) Cache enabled
>>     2) Decompress
>>     3) Nuke cache (S/W)
>>     4) Cache off
>>     5) Access new kernel
>>
>> If you skip #3, the decompress data may not have reached the memory, so
>> you would access stall data.
>>
>> This would effectively mean we don't support Linux Arm 32-bit.
> 
> So Marc said that #3 "doesn't make sense", since although it might be
> the only cpu on in the system, you're not "about to go down"; but Linux
> 32-bit is doing that anyway.

"Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
ARMv4, and has been left untouched ever since. "If it ain't broke..."

> It sounds like from the slides the purpose of #3 might be to get stuff
> out of the D-cache into the I-cache.  But why is the cache turned off?

Linux mandates that the kernel in entered with the MMU off. Which has
the effect of disabling the caches too (VIVT caches and all that jazz).

> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?

Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
break stuff from the late 90s, so that's not going to happen. These
days, I tend to pick my battles... ;-)

	M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 16:58                 ` Marc Zyngier
@ 2017-12-07 18:06                   ` George Dunlap
  2017-12-07 19:21                     ` Marc Zyngier
  0 siblings, 1 reply; 41+ messages in thread
From: George Dunlap @ 2017-12-07 18:06 UTC (permalink / raw)
  To: Marc Zyngier, Julien Grall, Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Andrew Cooper, xen-devel

On 12/07/2017 04:58 PM, Marc Zyngier wrote:
> On 07/12/17 16:44, George Dunlap wrote:
>> On 12/07/2017 04:04 PM, Julien Grall wrote:
>>> Hi Jan,
>>>
>>> On 07/12/17 15:45, Jan Beulich wrote:
>>>>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>>>>> On 07/12/17 13:52, Julien Grall wrote:
>>>>> There is exactly one case where set/way makes sense, and that's when
>>>>> you're the only CPU left in the system, your MMU is off, and you're
>>>>> about to go down.
>>>>
>>>> With this and ...
>>>>
>>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>>>> migrating from one CPU to another. So you could happily be flushing by
>>>>> S/W, and still end up with dirty lines in your cache. Success!
>>>>
>>>> ... this I wonder what value emulating those insns then has in the first
>>>> place. Can't you as well simply skip and ignore them, with the same
>>>> (bad) result?
>>>
>>> The result will be much much worst. Here a concrete example with a Linux
>>> Arm 32-bit:
>>>
>>>     1) Cache enabled
>>>     2) Decompress
>>>     3) Nuke cache (S/W)
>>>     4) Cache off
>>>     5) Access new kernel
>>>
>>> If you skip #3, the decompress data may not have reached the memory, so
>>> you would access stall data.
>>>
>>> This would effectively mean we don't support Linux Arm 32-bit.
>>
>> So Marc said that #3 "doesn't make sense", since although it might be
>> the only cpu on in the system, you're not "about to go down"; but Linux
>> 32-bit is doing that anyway.
> 
> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
> ARMv4, and has been left untouched ever since. "If it ain't broke..."
> 
>> It sounds like from the slides the purpose of #3 might be to get stuff
>> out of the D-cache into the I-cache.  But why is the cache turned off?
> 
> Linux mandates that the kernel in entered with the MMU off. Which has
> the effect of disabling the caches too (VIVT caches and all that jazz).
> 
>> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
> 
> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
> break stuff from the late 90s, so that's not going to happen. These
> days, I tend to pick my battles... ;-)

OK, so let me try to state this "forwards" for those of us not familiar
with the situation:

1. Linux expects to start in 'linear' mode, with the MMU disabled.

2. On ARM, disabling the MMU disables caching (!).  But disabling
caching doesn't flush the cache; it just means the cache is bypassed (!).

3. Which means for Linux on ARM, after unzipping the kernel image, you
need to flush the cache before disabling the MMU and starting Linux proper

4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
flush the cache.  This still works on 32-bit hardware, and so the Linux
maintainers are loathe to change it, even though more reliable VA-based
instructions are available (?).

5. For 64-bit hardware, the S/W instructions don't affect the L3 cache
[1] (?!).  So a 32-bit guest on a 64-bit host the above is entirely broken.

6. Rather than fix this in Linux, KVM has added a work-around in which
the *hypervisor* flushes the caches at certain points (!!!).  Julien is
looking into doing the same with Xen.

Is that about right?

Given the variety of hardware that Linux has to run on, it's hard to
understand why 1) 32-bit ARM Linux couldn't detect if it would be
appropriate to use VA-based instructions rather than S/W instructions 2)
There couldn't at least be a Kconfig option to use VA instructions
instead of S/W instructions.

 -George

[1]
https://events.linuxfoundation.org/sites/events/files/slides/slides_10.pdf,
slide 9

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 18:06                   ` George Dunlap
@ 2017-12-07 19:21                     ` Marc Zyngier
  2017-12-08 10:56                       ` George Dunlap
  0 siblings, 1 reply; 41+ messages in thread
From: Marc Zyngier @ 2017-12-07 19:21 UTC (permalink / raw)
  To: George Dunlap, Julien Grall, Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Andrew Cooper, xen-devel

On 07/12/17 18:06, George Dunlap wrote:
> On 12/07/2017 04:58 PM, Marc Zyngier wrote:
>> On 07/12/17 16:44, George Dunlap wrote:
>>> On 12/07/2017 04:04 PM, Julien Grall wrote:
>>>> Hi Jan,
>>>>
>>>> On 07/12/17 15:45, Jan Beulich wrote:
>>>>>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>>>>>> On 07/12/17 13:52, Julien Grall wrote:
>>>>>> There is exactly one case where set/way makes sense, and that's when
>>>>>> you're the only CPU left in the system, your MMU is off, and you're
>>>>>> about to go down.
>>>>>
>>>>> With this and ...
>>>>>
>>>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>>>>> migrating from one CPU to another. So you could happily be flushing by
>>>>>> S/W, and still end up with dirty lines in your cache. Success!
>>>>>
>>>>> ... this I wonder what value emulating those insns then has in the first
>>>>> place. Can't you as well simply skip and ignore them, with the same
>>>>> (bad) result?
>>>>
>>>> The result will be much much worst. Here a concrete example with a Linux
>>>> Arm 32-bit:
>>>>
>>>>     1) Cache enabled
>>>>     2) Decompress
>>>>     3) Nuke cache (S/W)
>>>>     4) Cache off
>>>>     5) Access new kernel
>>>>
>>>> If you skip #3, the decompress data may not have reached the memory, so
>>>> you would access stall data.
>>>>
>>>> This would effectively mean we don't support Linux Arm 32-bit.
>>>
>>> So Marc said that #3 "doesn't make sense", since although it might be
>>> the only cpu on in the system, you're not "about to go down"; but Linux
>>> 32-bit is doing that anyway.
>>
>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
>> ARMv4, and has been left untouched ever since. "If it ain't broke..."
>>
>>> It sounds like from the slides the purpose of #3 might be to get stuff
>>> out of the D-cache into the I-cache.  But why is the cache turned off?
>>
>> Linux mandates that the kernel in entered with the MMU off. Which has
>> the effect of disabling the caches too (VIVT caches and all that jazz).
>>
>>> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
>>
>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
>> break stuff from the late 90s, so that's not going to happen. These
>> days, I tend to pick my battles... ;-)
> 
> OK, so let me try to state this "forwards" for those of us not familiar
> with the situation:
> 
> 1. Linux expects to start in 'linear' mode, with the MMU disabled.
> 
> 2. On ARM, disabling the MMU disables caching (!).  But disabling
> caching doesn't flush the cache; it just means the cache is bypassed (!).
> 
> 3. Which means for Linux on ARM, after unzipping the kernel image, you
> need to flush the cache before disabling the MMU and starting Linux proper
> 
> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
> flush the cache.  This still works on 32-bit hardware, and so the Linux
> maintainers are loathe to change it, even though more reliable VA-based
> instructions are available (?).

It also works on 64bit HW. It is just not easily virtualizable, which is
why we've removed all S/W from the 64bit Linux port a while ago.

> 
> 5. For 64-bit hardware, the S/W instructions don't affect the L3 cache
> [1] (?!).  So a 32-bit guest on a 64-bit host the above is entirely broken.

System caches in general can avoid implementing S/W. That's not specific
to 64bit. It is just that in general, 32bit systems do not have a very
deep cache hierarchy (there are of course a number of exceptions to this
rule). 64bit systems, on the other hand, can be much bigger and are
quite happily stacking a deep cache hierarchy.

> 6. Rather than fix this in Linux, KVM has added a work-around in which
> the *hypervisor* flushes the caches at certain points (!!!).  Julien is
> looking into doing the same with Xen.

The "at certain points" doesn't quite describe it. We fully emulate S/W
instruction using the biggest hammer we can find.

> Is that about right?

I think you got the gist of it.

> Given the variety of hardware that Linux has to run on, it's hard to
> understand why 1) 32-bit ARM Linux couldn't detect if it would be
> appropriate to use VA-based instructions rather than S/W instructions 2)
> There couldn't at least be a Kconfig option to use VA instructions
> instead of S/W instructions.

[Linux hat on]

1) There is hardly anything to detect. Both sets of CMOs are available
on a moderately recent implementation. What you'd want to detect is the
the kernel is "virtualizable", which is not an easy task.

2) Kconfig options are the way to hell. It took us 5 years to get a
32bit kernel that would boot on about anything, and we're not going to
go back.

An alternative option would be to switch to VA CMOs if compiled for
ARMv7 (and maybe v6), assuming that doesn't have any horrible side
effect with broken cache implementations (and there is a few out there).
You'll have to check that this doesn't regress on any existing HW.

Of course, none of that will solve the most important issue, which is to
boot an unmodified kernel from yesterday to install a distribution. If
you want to be able to do that, you'll have to use the aforementioned
hammer.

In the end, it really depends how much you care about 32bit Linux guests
(on both 32 and 64bit Xen), and what your user base expects as a level
of support.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-06 12:58   ` Julien Grall
                       ` (2 preceding siblings ...)
  2017-12-06 17:49     ` George Dunlap
@ 2017-12-08  8:03     ` Tim Deegan
  2017-12-08 14:38       ` Julien Grall
  3 siblings, 1 reply; 41+ messages in thread
From: Tim Deegan @ 2017-12-08  8:03 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, George Dunlap,
	Jan Beulich, Andre Przywara, xen-devel

Hi,

At 12:58 +0000 on 06 Dec (1512565090), Julien Grall wrote:
> On 12/06/2017 12:28 PM, George Dunlap wrote:
> > 2. It sounds like rather than using PoD, you could use the
> > "misconfigured p2m table" technique that x86 uses: set bits in the p2m
> > entry which cause a specific kind of HAP fault when accessed.  The fault
> > handler then looks in the p2m entry, and if it finds an otherwise valid
> > entry, it just fixes the "misconfigured" bits and continues.
> 
> I thought about this. But when do you set the entry to misconfigured?
> 
> If you take the example of Linux 32-bit. There are a couple of full 
> cache clean during the boot of uni-processor. So you would need to go 
> through the p2m multiple time and reset the access bits.

My 2c (echoing what some others have already said):

+1 for avoiding the full majesty of PoD if you don't need it.

It should be possible to do something like the misconfigured-entry bit
trick by _allocating_ the memory up-front and building the p2m entries
but only making them usable by the {IO}MMUs on first access.  That
would make these early p2m walks shorter (because they can skip whole
subtrees that aren't marked present yet) without making major changes
to domain build or introducing run-time failures.

Also beware of DoS conditions -- a guest that touches all its memory
and then flushes by set/way mustn't be allowed to hurt the rest of the
system.  That probably means the set/way flush has to be preemptable.

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-07 19:21                     ` Marc Zyngier
@ 2017-12-08 10:56                       ` George Dunlap
  2017-12-11 11:10                         ` Andre Przywara
  0 siblings, 1 reply; 41+ messages in thread
From: George Dunlap @ 2017-12-08 10:56 UTC (permalink / raw)
  To: Marc Zyngier, Julien Grall, Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	Andrew Cooper, xen-devel

On 12/07/2017 07:21 PM, Marc Zyngier wrote:
> On 07/12/17 18:06, George Dunlap wrote:
>> On 12/07/2017 04:58 PM, Marc Zyngier wrote:
>>> On 07/12/17 16:44, George Dunlap wrote:
>>>> On 12/07/2017 04:04 PM, Julien Grall wrote:
>>>>> Hi Jan,
>>>>>
>>>>> On 07/12/17 15:45, Jan Beulich wrote:
>>>>>>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>>>>>>> On 07/12/17 13:52, Julien Grall wrote:
>>>>>>> There is exactly one case where set/way makes sense, and that's when
>>>>>>> you're the only CPU left in the system, your MMU is off, and you're
>>>>>>> about to go down.
>>>>>>
>>>>>> With this and ...
>>>>>>
>>>>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>>>>>> migrating from one CPU to another. So you could happily be flushing by
>>>>>>> S/W, and still end up with dirty lines in your cache. Success!
>>>>>>
>>>>>> ... this I wonder what value emulating those insns then has in the first
>>>>>> place. Can't you as well simply skip and ignore them, with the same
>>>>>> (bad) result?
>>>>>
>>>>> The result will be much much worst. Here a concrete example with a Linux
>>>>> Arm 32-bit:
>>>>>
>>>>>     1) Cache enabled
>>>>>     2) Decompress
>>>>>     3) Nuke cache (S/W)
>>>>>     4) Cache off
>>>>>     5) Access new kernel
>>>>>
>>>>> If you skip #3, the decompress data may not have reached the memory, so
>>>>> you would access stall data.
>>>>>
>>>>> This would effectively mean we don't support Linux Arm 32-bit.
>>>>
>>>> So Marc said that #3 "doesn't make sense", since although it might be
>>>> the only cpu on in the system, you're not "about to go down"; but Linux
>>>> 32-bit is doing that anyway.
>>>
>>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
>>> ARMv4, and has been left untouched ever since. "If it ain't broke..."
>>>
>>>> It sounds like from the slides the purpose of #3 might be to get stuff
>>>> out of the D-cache into the I-cache.  But why is the cache turned off?
>>>
>>> Linux mandates that the kernel in entered with the MMU off. Which has
>>> the effect of disabling the caches too (VIVT caches and all that jazz).
>>>
>>>> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
>>>
>>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
>>> break stuff from the late 90s, so that's not going to happen. These
>>> days, I tend to pick my battles... ;-)
>>
>> OK, so let me try to state this "forwards" for those of us not familiar
>> with the situation:
>>
>> 1. Linux expects to start in 'linear' mode, with the MMU disabled.
>>
>> 2. On ARM, disabling the MMU disables caching (!).  But disabling
>> caching doesn't flush the cache; it just means the cache is bypassed (!).
>>
>> 3. Which means for Linux on ARM, after unzipping the kernel image, you
>> need to flush the cache before disabling the MMU and starting Linux proper
>>
>> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
>> flush the cache.  This still works on 32-bit hardware, and so the Linux
>> maintainers are loathe to change it, even though more reliable VA-based
>> instructions are available (?).
> 
> It also works on 64bit HW. It is just not easily virtualizable, which is
> why we've removed all S/W from the 64bit Linux port a while ago.

From the diagram in your talk, it looked like the "flush the cache"
operation *doesn't* work anywhere that has a "system cache", even on
bare metal.

>> 6. Rather than fix this in Linux, KVM has added a work-around in which
>> the *hypervisor* flushes the caches at certain points (!!!).  Julien is
>> looking into doing the same with Xen.
> 
> The "at certain points" doesn't quite describe it. We fully emulate S/W
> instruction using the biggest hammer we can find.

Oh, I thought Julien was saying something about flushing the guest's RAM
every time caching was enabled or disabled.

>> Given the variety of hardware that Linux has to run on, it's hard to
>> understand why 1) 32-bit ARM Linux couldn't detect if it would be
>> appropriate to use VA-based instructions rather than S/W instructions 2)
>> There couldn't at least be a Kconfig option to use VA instructions
>> instead of S/W instructions.
> 
> [Linux hat on]
> 
> 1) There is hardly anything to detect. Both sets of CMOs are available
> on a moderately recent implementation. What you'd want to detect is the
> the kernel is "virtualizable", which is not an easy task.
<snip>
> An alternative option would be to switch to VA CMOs if compiled for
> ARMv7 (and maybe v6), assuming that doesn't have any horrible side
> effect with broken cache implementations (and there is a few out there).
> You'll have to check that this doesn't regress on any existing HW.

So the idea would be to use the VA-based operations if available, and
then special-case specific chipsets known to have issues.  Linux (and
Xen and...) end up doing this for lots of different kinds of hardware;
this would be no different.

> 2) Kconfig options are the way to hell. It took us 5 years to get a
> 32bit kernel that would boot on about anything, and we're not going to
> go back.

Well, at the moment you *don't* have a 32-bit kernel that will boot on
anything.  It won't boot (it sounds like) on any 32-bit system that has
a system cache, including a 64-bit hypervisor providing a 32-bit guest.

Alternately, would it make sense to have a PV "cache flush" operation
for hypervisors?  x86 has a way to expose hypervisor capabilities via
specific CPUID leaves.  Does anything like this exist for ARM?  If so,
the code could be, "If virtualized and hypervisor provides PV cache
flush, use that.  Otherwise, fall back to S/W operation."

> Of course, none of that will solve the most important issue, which is to
> boot an unmodified kernel from yesterday to install a distribution. If
> you want to be able to do that, you'll have to use the aforementioned
> hammer.

Well it will take time to code up a solution and get *that* into user's
hands as well.  I would think the fastest way to get *most* distros
working would be to open a ticket saying it's broken on virtual
hardware, and asking them to apply a patch.  Then the priority of
getting more "enterprisey" distros working if and when.

Just to be clear -- I'm just trying to help push to explore other
options here.  I'm not opposed to Julien or someone making a work-around
in Xen.  But it's quite a bit of effort to achieve a pretty crappy end,
so I think it's worth exploring what kind of effort we could spend
achieving a "proper" fix first.

(Thanks also for taking the time to help explain this.)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-08  8:03     ` Tim Deegan
@ 2017-12-08 14:38       ` Julien Grall
  2017-12-10 15:22         ` Tim Deegan
  2017-12-11 10:06         ` Jan Beulich
  0 siblings, 2 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-08 14:38 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, George Dunlap,
	Jan Beulich, Andre Przywara, xen-devel

On 08/12/17 08:03, Tim Deegan wrote:
> Hi,

Hi Tim,

Somehow your e-mail was marked as spam by gmail.

> At 12:58 +0000 on 06 Dec (1512565090), Julien Grall wrote:
>> On 12/06/2017 12:28 PM, George Dunlap wrote:
>>> 2. It sounds like rather than using PoD, you could use the
>>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>>> entry which cause a specific kind of HAP fault when accessed.  The fault
>>> handler then looks in the p2m entry, and if it finds an otherwise valid
>>> entry, it just fixes the "misconfigured" bits and continues.
>>
>> I thought about this. But when do you set the entry to misconfigured?
>>
>> If you take the example of Linux 32-bit. There are a couple of full
>> cache clean during the boot of uni-processor. So you would need to go
>> through the p2m multiple time and reset the access bits.
> 
> My 2c (echoing what some others have already said):
> 
> +1 for avoiding the full majesty of PoD if you don't need it.
> 
> It should be possible to do something like the misconfigured-entry bit
> trick by _allocating_ the memory up-front and building the p2m entries
> but only making them usable by the {IO}MMUs on first access.  That
> would make these early p2m walks shorter (because they can skip whole
> subtrees that aren't marked present yet) without making major changes
> to domain build or introducing run-time failures.

I am not aware of any way on Arm to misconfigure an entry. We do have 
valid and access bits, although they will affect the IOMMU as well. So 
it will not be possible to get page-table sharing with this "feature" 
enabled.

At the moment, I am thinking to provide a per-guest option to turn 
on/off the possibility to use the valid/access bit. That will be at the 
expense to do a full invalidate on S/W.

> Also beware of DoS conditions -- a guest that touches all its memory
> and then flushes by set/way mustn't be allowed to hurt the rest of the
> system.  That probably means the set/way flush has to be preemptable.

I am fully aware about it :). This was actually mentioned in my first 
e-mail.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-08 14:38       ` Julien Grall
@ 2017-12-10 15:22         ` Tim Deegan
  2017-12-11 19:50           ` Julien Grall
  2017-12-11 10:06         ` Jan Beulich
  1 sibling, 1 reply; 41+ messages in thread
From: Tim Deegan @ 2017-12-10 15:22 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, George Dunlap,
	Jan Beulich, Andre Przywara, xen-devel

At 14:38 +0000 on 08 Dec (1512743913), Julien Grall wrote:
> On 08/12/17 08:03, Tim Deegan wrote:
> > +1 for avoiding the full majesty of PoD if you don't need it.
> > 
> > It should be possible to do something like the misconfigured-entry bit
> > trick by _allocating_ the memory up-front and building the p2m entries
> > but only making them usable by the {IO}MMUs on first access.  That
> > would make these early p2m walks shorter (because they can skip whole
> > subtrees that aren't marked present yet) without making major changes
> > to domain build or introducing run-time failures.
> 
> I am not aware of any way on Arm to misconfigure an entry. We do have 
> valid and access bits, although they will affect the IOMMU as well. So 
> it will not be possible to get page-table sharing with this "feature" 
> enabled.

How unfortunate.  How does KVM's demand-population scheme handle the IOMMU? 

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-08 14:38       ` Julien Grall
  2017-12-10 15:22         ` Tim Deegan
@ 2017-12-11 10:06         ` Jan Beulich
  2017-12-11 11:11           ` Andrew Cooper
  2017-12-11 20:26           ` Julien Grall
  1 sibling, 2 replies; 41+ messages in thread
From: Jan Beulich @ 2017-12-11 10:06 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

>>> On 08.12.17 at 15:38, <julien.grall@linaro.org> wrote:
> On 08/12/17 08:03, Tim Deegan wrote:
>> It should be possible to do something like the misconfigured-entry bit
>> trick by _allocating_ the memory up-front and building the p2m entries
>> but only making them usable by the {IO}MMUs on first access.  That
>> would make these early p2m walks shorter (because they can skip whole
>> subtrees that aren't marked present yet) without making major changes
>> to domain build or introducing run-time failures.
> 
> I am not aware of any way on Arm to misconfigure an entry. We do have 
> valid and access bits, although they will affect the IOMMU as well. So 
> it will not be possible to get page-table sharing with this "feature" 
> enabled.

How would you intend to solve the IOMMU part of the problem with
PoD? As was pointed out before - IOMMU and PoD are incompatible
on x86.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-08 10:56                       ` George Dunlap
@ 2017-12-11 11:10                         ` Andre Przywara
  2017-12-11 12:15                           ` George Dunlap
  2017-12-11 21:11                           ` Julien Grall
  0 siblings, 2 replies; 41+ messages in thread
From: Andre Przywara @ 2017-12-11 11:10 UTC (permalink / raw)
  To: George Dunlap, Marc Zyngier, Julien Grall, Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Stefano Stabellini, Tim Deegan, xen-devel

Hi,

On 08/12/17 10:56, George Dunlap wrote:
> On 12/07/2017 07:21 PM, Marc Zyngier wrote:
>> On 07/12/17 18:06, George Dunlap wrote:
>>> On 12/07/2017 04:58 PM, Marc Zyngier wrote:
>>>> On 07/12/17 16:44, George Dunlap wrote:
>>>>> On 12/07/2017 04:04 PM, Julien Grall wrote:
>>>>>> Hi Jan,
>>>>>>
>>>>>> On 07/12/17 15:45, Jan Beulich wrote:
>>>>>>>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>>>>>>>> On 07/12/17 13:52, Julien Grall wrote:
>>>>>>>> There is exactly one case where set/way makes sense, and that's when
>>>>>>>> you're the only CPU left in the system, your MMU is off, and you're
>>>>>>>> about to go down.
>>>>>>>
>>>>>>> With this and ...
>>>>>>>
>>>>>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>>>>>>> migrating from one CPU to another. So you could happily be flushing by
>>>>>>>> S/W, and still end up with dirty lines in your cache. Success!
>>>>>>>
>>>>>>> ... this I wonder what value emulating those insns then has in the first
>>>>>>> place. Can't you as well simply skip and ignore them, with the same
>>>>>>> (bad) result?
>>>>>>
>>>>>> The result will be much much worst. Here a concrete example with a Linux
>>>>>> Arm 32-bit:
>>>>>>
>>>>>>     1) Cache enabled
>>>>>>     2) Decompress
>>>>>>     3) Nuke cache (S/W)
>>>>>>     4) Cache off
>>>>>>     5) Access new kernel
>>>>>>
>>>>>> If you skip #3, the decompress data may not have reached the memory, so
>>>>>> you would access stall data.
>>>>>>
>>>>>> This would effectively mean we don't support Linux Arm 32-bit.
>>>>>
>>>>> So Marc said that #3 "doesn't make sense", since although it might be
>>>>> the only cpu on in the system, you're not "about to go down"; but Linux
>>>>> 32-bit is doing that anyway.
>>>>
>>>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
>>>> ARMv4, and has been left untouched ever since. "If it ain't broke..."
>>>>
>>>>> It sounds like from the slides the purpose of #3 might be to get stuff
>>>>> out of the D-cache into the I-cache.  But why is the cache turned off?
>>>>
>>>> Linux mandates that the kernel in entered with the MMU off. Which has
>>>> the effect of disabling the caches too (VIVT caches and all that jazz).
>>>>
>>>>> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
>>>>
>>>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
>>>> break stuff from the late 90s, so that's not going to happen. These
>>>> days, I tend to pick my battles... ;-)
>>>
>>> OK, so let me try to state this "forwards" for those of us not familiar
>>> with the situation:
>>>
>>> 1. Linux expects to start in 'linear' mode, with the MMU disabled.
>>>
>>> 2. On ARM, disabling the MMU disables caching (!).  But disabling
>>> caching doesn't flush the cache; it just means the cache is bypassed (!).
>>>
>>> 3. Which means for Linux on ARM, after unzipping the kernel image, you
>>> need to flush the cache before disabling the MMU and starting Linux proper
>>>
>>> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
>>> flush the cache.  This still works on 32-bit hardware, and so the Linux
>>> maintainers are loathe to change it, even though more reliable VA-based
>>> instructions are available (?).
>>
>> It also works on 64bit HW. It is just not easily virtualizable, which is
>> why we've removed all S/W from the 64bit Linux port a while ago.
> 
> From the diagram in your talk, it looked like the "flush the cache"
> operation *doesn't* work anywhere that has a "system cache", even on
> bare metal.

What Marc probably meant is that they still work *within the
architectural limits* that s/w operations provide:
- S/W CMOs are not broadcasted, so in a live SMP system they are
probably not doing what you expect them to do. This isn't an issue for a
32-bit Linux kernel decompressor, because this is UP still at this point.
- S/W CMOs are optional to implement for system caches. As Marc
mentioned, there are not many 32-bit systems with a system cache out
there. And on those systems you can still boot an uncompressed kernel or
use gzip-ed kernel and let the bootloader (grub, U-Boot) decompress it.
On the other hand there seem to be a substantial number of (older)
32-bit systems where VA CMOs have issues.

The problem now is that for the "32-bit kernel on a 64-bit hypervisor"
cache those two assumptions are not true: The system has multiple CPUs
running already, also 64-bit hardware is much more likely to have system
caches.
So this is mostly a virtualization problem and thus should be solved here.

To help assessing the benefits of adding PoD to Xen:
I did some tracing on Friday with a 32-bit kernel on a (64-bit) Juno
with KVM. I see *four* full cache cleans very early on each boot (first
s/w op + caches turned on, twice), plus one cache clean when each (v)CPU
is brought online (due to the initial "turn MMU and cache on" operation).
During the runtime of the kernel there are no s/w ops, except for (v)CPU
off/on-lining (echo [01] > /sys/devices/system/cpu/cpu<n>/online).
I believe these are bogus, as I see the caches still being on, but
that's how it is. Also this is probably not performance critical due to
the nature of this operation.

Having PoD at this point would be quite helpful, as very early at boot
we don't expect much memory to be already used, so the "full VA space
cache clean" doesn't have much to do. This leads to a 32-bit kernel boot
in KVM to not be noticeably slower than a 64-bit kernel boot.

But on the other hand we had PoD naturally already in KVM, so this came
at no cost.
So I believe it would be worth to investigate what the actual impact is
on booting a 32-bit kernel, with emulating s/w ops like KVM does (see
below), but cleaning the *whole VA space*. If this is somewhat
acceptable (I assume we have no more than 2GB for a typical ARM32
guest), it might be worth to ignore PoD, at least for now and to solve
this problem (and the IOMMU consequences).

This assumes that a single "full VA flush" cannot be abused as a DOS by
a malicious guest, which should be investigated independently (as this
applies to a PoD implementation as well).



Somewhat optional read for the background of how KVM optimized this ([1]):

KVM's solution to this problem works under the assumption that s/w
operations with the caches (and MMU on) are not really meaningful, so we
don't bother emulating them to the letter. Also we assume that the
purpose of s/w CMOs is to clean the whole cache. So KVM does two things
to avoid too much work:
- The first trapped s/w op flushes the whole guest VA space. It then
turns "VM op" traps on, to detect when the caches get turned on.
This basically does the work ("flush my whole cache") already on the
first s/w op. Further trapped s/w ops are treated as NOPs then.
- When a trapped VM op signals that the caches are turned on again, we
also clean the whole cache. We then turn VM op trapping *off* again. The
next trapped s/w op would turn it back on.

Those two features are pretty straight forward to implement, avoid
actual s/w operations most of the time (all but the first s/w op are
emulated as NOPs), but still makes it safe within the architectural
limits. Plus this code is not normally triggered during the actual
kernel runtime, but only on early boot (decompressor plus SMP bringup).

>>> 6. Rather than fix this in Linux, KVM has added a work-around in which
>>> the *hypervisor* flushes the caches at certain points (!!!).  Julien is
>>> looking into doing the same with Xen.
>>
>> The "at certain points" doesn't quite describe it. We fully emulate S/W
>> instruction using the biggest hammer we can find.
> 
> Oh, I thought Julien was saying something about flushing the guest's RAM
> every time caching was enabled or disabled.

Yes, that's what it does ([2]), but usually that's at early boot and we
don't have many pages actually populated at this point. Hence Julien's
PoD proposal to allow using the same optimization.

Cheers,
Andre.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/mmu.c#n1960
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/mmu.c#n382

>>> Given the variety of hardware that Linux has to run on, it's hard to
>>> understand why 1) 32-bit ARM Linux couldn't detect if it would be
>>> appropriate to use VA-based instructions rather than S/W instructions 2)
>>> There couldn't at least be a Kconfig option to use VA instructions
>>> instead of S/W instructions.
>>
>> [Linux hat on]
>>
>> 1) There is hardly anything to detect. Both sets of CMOs are available
>> on a moderately recent implementation. What you'd want to detect is the
>> the kernel is "virtualizable", which is not an easy task.
> <snip>
>> An alternative option would be to switch to VA CMOs if compiled for
>> ARMv7 (and maybe v6), assuming that doesn't have any horrible side
>> effect with broken cache implementations (and there is a few out there).
>> You'll have to check that this doesn't regress on any existing HW.
> 
> So the idea would be to use the VA-based operations if available, and
> then special-case specific chipsets known to have issues.  Linux (and
> Xen and...) end up doing this for lots of different kinds of hardware;
> this would be no different.
> 
>> 2) Kconfig options are the way to hell. It took us 5 years to get a
>> 32bit kernel that would boot on about anything, and we're not going to
>> go back.
> 
> Well, at the moment you *don't* have a 32-bit kernel that will boot on
> anything.  It won't boot (it sounds like) on any 32-bit system that has
> a system cache, including a 64-bit hypervisor providing a 32-bit guest.
> 
> Alternately, would it make sense to have a PV "cache flush" operation
> for hypervisors?  x86 has a way to expose hypervisor capabilities via
> specific CPUID leaves.  Does anything like this exist for ARM?  If so,
> the code could be, "If virtualized and hypervisor provides PV cache
> flush, use that.  Otherwise, fall back to S/W operation."
> 
>> Of course, none of that will solve the most important issue, which is to
>> boot an unmodified kernel from yesterday to install a distribution. If
>> you want to be able to do that, you'll have to use the aforementioned
>> hammer.
> 
> Well it will take time to code up a solution and get *that* into user's
> hands as well.  I would think the fastest way to get *most* distros
> working would be to open a ticket saying it's broken on virtual
> hardware, and asking them to apply a patch.  Then the priority of
> getting more "enterprisey" distros working if and when.
> 
> Just to be clear -- I'm just trying to help push to explore other
> options here.  I'm not opposed to Julien or someone making a work-around
> in Xen.  But it's quite a bit of effort to achieve a pretty crappy end,
> so I think it's worth exploring what kind of effort we could spend
> achieving a "proper" fix first.
> 
> (Thanks also for taking the time to help explain this.)
> 
>  -George
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-11 10:06         ` Jan Beulich
@ 2017-12-11 11:11           ` Andrew Cooper
  2017-12-11 11:58             ` Jan Beulich
  2017-12-11 20:26           ` Julien Grall
  1 sibling, 1 reply; 41+ messages in thread
From: Andrew Cooper @ 2017-12-11 11:11 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, xen-devel

On 11/12/17 10:06, Jan Beulich wrote:
>>>> On 08.12.17 at 15:38, <julien.grall@linaro.org> wrote:
>> On 08/12/17 08:03, Tim Deegan wrote:
>>> It should be possible to do something like the misconfigured-entry bit
>>> trick by _allocating_ the memory up-front and building the p2m entries
>>> but only making them usable by the {IO}MMUs on first access.  That
>>> would make these early p2m walks shorter (because they can skip whole
>>> subtrees that aren't marked present yet) without making major changes
>>> to domain build or introducing run-time failures.
>> I am not aware of any way on Arm to misconfigure an entry. We do have 
>> valid and access bits, although they will affect the IOMMU as well. So 
>> it will not be possible to get page-table sharing with this "feature" 
>> enabled.
> How would you intend to solve the IOMMU part of the problem with
> PoD? As was pointed out before - IOMMU and PoD are incompatible
> on x86.

Not only that.

The use of an IOMMU is incompatible with any HAP scheme using EPT/NPT
violations to trigger hypervisor work, and will remain the case until
such time as IOMMUs gain restartable pagefaults.  The chances of this
happening are tantamount to zero, due to timing requirements in the
PCI(e) spec.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-11 11:11           ` Andrew Cooper
@ 2017-12-11 11:58             ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2017-12-11 11:58 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Julien Grall,
	Tim Deegan, George Dunlap, xen-devel

>>> On 11.12.17 at 12:11, <andrew.cooper3@citrix.com> wrote:
> On 11/12/17 10:06, Jan Beulich wrote:
>>>>> On 08.12.17 at 15:38, <julien.grall@linaro.org> wrote:
>>> On 08/12/17 08:03, Tim Deegan wrote:
>>>> It should be possible to do something like the misconfigured-entry bit
>>>> trick by _allocating_ the memory up-front and building the p2m entries
>>>> but only making them usable by the {IO}MMUs on first access.  That
>>>> would make these early p2m walks shorter (because they can skip whole
>>>> subtrees that aren't marked present yet) without making major changes
>>>> to domain build or introducing run-time failures.
>>> I am not aware of any way on Arm to misconfigure an entry. We do have 
>>> valid and access bits, although they will affect the IOMMU as well. So 
>>> it will not be possible to get page-table sharing with this "feature" 
>>> enabled.
>> How would you intend to solve the IOMMU part of the problem with
>> PoD? As was pointed out before - IOMMU and PoD are incompatible
>> on x86.
> 
> Not only that.
> 
> The use of an IOMMU is incompatible with any HAP scheme using EPT/NPT
> violations to trigger hypervisor work,

For many forms of "hypervisor work" I agree, but our misconfig
scheme demonstrates that there are exceptions where the IOMMU
continues to work fine.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-11 11:10                         ` Andre Przywara
@ 2017-12-11 12:15                           ` George Dunlap
  2017-12-11 21:11                           ` Julien Grall
  1 sibling, 0 replies; 41+ messages in thread
From: George Dunlap @ 2017-12-11 12:15 UTC (permalink / raw)
  To: Andre Przywara, Marc Zyngier, Julien Grall, Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Stefano Stabellini, Tim Deegan, xen-devel

On 12/11/2017 11:10 AM, Andre Przywara wrote:
> Hi,
> 
> On 08/12/17 10:56, George Dunlap wrote:
>> On 12/07/2017 07:21 PM, Marc Zyngier wrote:
>>> On 07/12/17 18:06, George Dunlap wrote:
>>>> On 12/07/2017 04:58 PM, Marc Zyngier wrote:
>>>>> On 07/12/17 16:44, George Dunlap wrote:
>>>>>> On 12/07/2017 04:04 PM, Julien Grall wrote:
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> On 07/12/17 15:45, Jan Beulich wrote:
>>>>>>>>>>> On 07.12.17 at 15:53, <marc.zyngier@arm.com> wrote:
>>>>>>>>> On 07/12/17 13:52, Julien Grall wrote:
>>>>>>>>> There is exactly one case where set/way makes sense, and that's when
>>>>>>>>> you're the only CPU left in the system, your MMU is off, and you're
>>>>>>>>> about to go down.
>>>>>>>>
>>>>>>>> With this and ...
>>>>>>>>
>>>>>>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>>>>>>>> migrating from one CPU to another. So you could happily be flushing by
>>>>>>>>> S/W, and still end up with dirty lines in your cache. Success!
>>>>>>>>
>>>>>>>> ... this I wonder what value emulating those insns then has in the first
>>>>>>>> place. Can't you as well simply skip and ignore them, with the same
>>>>>>>> (bad) result?
>>>>>>>
>>>>>>> The result will be much much worst. Here a concrete example with a Linux
>>>>>>> Arm 32-bit:
>>>>>>>
>>>>>>>     1) Cache enabled
>>>>>>>     2) Decompress
>>>>>>>     3) Nuke cache (S/W)
>>>>>>>     4) Cache off
>>>>>>>     5) Access new kernel
>>>>>>>
>>>>>>> If you skip #3, the decompress data may not have reached the memory, so
>>>>>>> you would access stall data.
>>>>>>>
>>>>>>> This would effectively mean we don't support Linux Arm 32-bit.
>>>>>>
>>>>>> So Marc said that #3 "doesn't make sense", since although it might be
>>>>>> the only cpu on in the system, you're not "about to go down"; but Linux
>>>>>> 32-bit is doing that anyway.
>>>>>
>>>>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
>>>>> ARMv4, and has been left untouched ever since. "If it ain't broke..."
>>>>>
>>>>>> It sounds like from the slides the purpose of #3 might be to get stuff
>>>>>> out of the D-cache into the I-cache.  But why is the cache turned off?
>>>>>
>>>>> Linux mandates that the kernel in entered with the MMU off. Which has
>>>>> the effect of disabling the caches too (VIVT caches and all that jazz).
>>>>>
>>>>>> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
>>>>>
>>>>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
>>>>> break stuff from the late 90s, so that's not going to happen. These
>>>>> days, I tend to pick my battles... ;-)
>>>>
>>>> OK, so let me try to state this "forwards" for those of us not familiar
>>>> with the situation:
>>>>
>>>> 1. Linux expects to start in 'linear' mode, with the MMU disabled.
>>>>
>>>> 2. On ARM, disabling the MMU disables caching (!).  But disabling
>>>> caching doesn't flush the cache; it just means the cache is bypassed (!).
>>>>
>>>> 3. Which means for Linux on ARM, after unzipping the kernel image, you
>>>> need to flush the cache before disabling the MMU and starting Linux proper
>>>>
>>>> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
>>>> flush the cache.  This still works on 32-bit hardware, and so the Linux
>>>> maintainers are loathe to change it, even though more reliable VA-based
>>>> instructions are available (?).
>>>
>>> It also works on 64bit HW. It is just not easily virtualizable, which is
>>> why we've removed all S/W from the 64bit Linux port a while ago.
>>
>> From the diagram in your talk, it looked like the "flush the cache"
>> operation *doesn't* work anywhere that has a "system cache", even on
>> bare metal.
> 
> What Marc probably meant is that they still work *within the
> architectural limits* that s/w operations provide:
> - S/W CMOs are not broadcasted, so in a live SMP system they are
> probably not doing what you expect them to do. This isn't an issue for a
> 32-bit Linux kernel decompressor, because this is UP still at this point.
> - S/W CMOs are optional to implement for system caches. As Marc
> mentioned, there are not many 32-bit systems with a system cache out
> there.

Right, that's what I said -- on any 32-bit system with a system cache,
which doesn't implement the S/W functionality, then using S/W to flush
the cache won't work, even on bare metal.

> And on those systems you can still boot an uncompressed kernel or
> use gzip-ed kernel and let the bootloader (grub, U-Boot) decompress it.
> On the other hand there seem to be a substantial number of (older)
> 32-bit systems where VA CMOs have issues.

OK, good to know.

> The problem now is that for the "32-bit kernel on a 64-bit hypervisor"
> cache those two assumptions are not true: The system has multiple CPUs
> running already, also 64-bit hardware is much more likely to have system
> caches.
> So this is mostly a virtualization problem and thus should be solved here.

Right.

> To help assessing the benefits of adding PoD to Xen:

Can we come up with a different terminology for this functionality than
'PoD'?  On x86 populate-on-demand is quite different in functionality
and in target goal than what Julien is describing.

The goal of PoD on x86 is being able to boot a guest that actually uses
(say) 1GiB of RAM, but allow it to balloon up later to use 2GiB
megabytes of RAM, in circumstances where memory hotplug is not
available.  This means telling a guest it has 2GiB of RAM, but only
allocating 1GiB of host RAM for it, and shuffling memory around
behind-the-scenes until the balloon driver can come up and "free" 1GiB
of empty space back to Xen.

On x86 in PoD, the p2m table is initialized with entries which are
'empty' from the hardware point of view (no mfn).  Memory is allocated
to a per-domain "PoD pool" on domain creation, then assigned to the p2m
as it's used.  If the memory remains zero, then it may be reclaimed
under certain circumstances and moved somewhere else.  Once the memory
becomes non-zero, it must never be moved.  If a guest ever "dirties" all
of its initial allocation (i.e., makes it non-zero), then Xen will crash
it rather than allocate more memory.

What Julien is describing is different.  For one thing, for many dom0's,
it's not appropriate to put memory in abritrary places; you need a 1-1
mapping, so the "populate with random memory from a pool" isn't
appropriate.  For another, Julien will (I think?) want a way to detect
reads and writes to memory pages which have non-zero data.  This is not
something that the current PoD code has anything to do with.

It also seems like in the future, ARM may want something like the x86
PoD (i.e., the ability to boot a guest with 1GiB of RAM and then balloon
it up to 2GiB).  So keeping the 'PoD' name reserved for that
functionality makes more sense.

In fact, what it sounds like is an awful lot like 'logdirty', except
that it sounds like you want to log read accesses in addition to write
accesses (to determine what might be in the cache).  Maybe 'logaccess' mode?

> But on the other hand we had PoD naturally already in KVM, so this came
> at no cost.

As I've said in another thread, it's not accurate to say that KVM uses
PoD.  In PoD, the memory is pre-allocated to the domain before the guest
starts; I assume on KVM the memory isn't allocated until it's used (like
a normal process).  In PoD, if the total amount of non-zero memory in
the guest exceeds this amount, then Xen will crash the guest.  In KVM, I
assume that there is no implicit limit: if it doesn't have free host ram
when the allocation happens, then it evicts something from a buffer or
swaps some process / VM memory out to disk.

Hope I'm not being too pedantic here, but "the devil is in the details",
so I think it's important when comparing KVM and Xen's solutions to be
aware of the differences. :-)

In any case, if Julien wants to emulate the S/W instructions, it seems
like having 'logaccess' functionality in Xen is probably the only
reasonable way to accomplish that (as 'full VA flush' will quickly
become unworkable as the guest size grows).

> So I believe it would be worth to investigate what the actual impact is
> on booting a 32-bit kernel, with emulating s/w ops like KVM does (see
> below), but cleaning the *whole VA space*. If this is somewhat
> acceptable (I assume we have no more than 2GB for a typical ARM32
> guest), it might be worth to ignore PoD, at least for now and to solve
> this problem (and the IOMMU consequences).
> 
> This assumes that a single "full VA flush" cannot be abused as a DOS by
> a malicious guest, which should be investigated independently (as this
> applies to a PoD implementation as well).

Well the flush itself would need to be preemptible.  And it sounds like
you'd need to handle migrating specially somehow too.  For one you'd
need to make sure at least that all the cache on the current pcpu was
"cleaned" before running a vcpu anywhere else; and you'd also need to
make sure that any pcpu on which the vcpu had ever run had its entries
"invalidated" before the vcpu was run there again.

> Somewhat optional read for the background of how KVM optimized this ([1]):
> 
> KVM's solution to this problem works under the assumption that s/w
> operations with the caches (and MMU on) are not really meaningful, so we
> don't bother emulating them to the letter. 

Right -- so even on KVM, you're not actually following the ARM spec wrt
the S/W instructions: you're only handling the case that's fairly common
(i.e., flushing the cache with the MMU off).

Thanks,
 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-10 15:22         ` Tim Deegan
@ 2017-12-11 19:50           ` Julien Grall
  0 siblings, 0 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-11 19:50 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, George Dunlap,
	Marc Zyngier, Jan Beulich, Andre Przywara, xen-devel

Hi,

On 12/10/2017 03:22 PM, Tim Deegan wrote:
> At 14:38 +0000 on 08 Dec (1512743913), Julien Grall wrote:
>> On 08/12/17 08:03, Tim Deegan wrote:
>>> +1 for avoiding the full majesty of PoD if you don't need it.
>>>
>>> It should be possible to do something like the misconfigured-entry bit
>>> trick by _allocating_ the memory up-front and building the p2m entries
>>> but only making them usable by the {IO}MMUs on first access.  That
>>> would make these early p2m walks shorter (because they can skip whole
>>> subtrees that aren't marked present yet) without making major changes
>>> to domain build or introducing run-time failures.
>>
>> I am not aware of any way on Arm to misconfigure an entry. We do have
>> valid and access bits, although they will affect the IOMMU as well. So
>> it will not be possible to get page-table sharing with this "feature"
>> enabled.
> 
> How unfortunate.  How does KVM's demand-population scheme handle the IOMMU?

 From what I have heard, when using IOMMU all the memory is pinned. They 
also don't share page-tables.

But I am not a KVM expert, maybe Andre/Marc can confirm here?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-11 10:06         ` Jan Beulich
  2017-12-11 11:11           ` Andrew Cooper
@ 2017-12-11 20:26           ` Julien Grall
  2017-12-12  7:52             ` Jan Beulich
  1 sibling, 1 reply; 41+ messages in thread
From: Julien Grall @ 2017-12-11 20:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

Hi Jan,

On 12/11/2017 10:06 AM, Jan Beulich wrote:
>>>> On 08.12.17 at 15:38, <julien.grall@linaro.org> wrote:
>> On 08/12/17 08:03, Tim Deegan wrote:
>>> It should be possible to do something like the misconfigured-entry bit
>>> trick by _allocating_ the memory up-front and building the p2m entries
>>> but only making them usable by the {IO}MMUs on first access.  That
>>> would make these early p2m walks shorter (because they can skip whole
>>> subtrees that aren't marked present yet) without making major changes
>>> to domain build or introducing run-time failures.
>>
>> I am not aware of any way on Arm to misconfigure an entry. We do have
>> valid and access bits, although they will affect the IOMMU as well. So
>> it will not be possible to get page-table sharing with this "feature"
>> enabled.
> 
> How would you intend to solve the IOMMU part of the problem with
> PoD? As was pointed out before - IOMMU and PoD are incompatible
> on x86.

I am not sure why you ask about PoD here when I acknowledge I will look 
at a different solution. And again, misconfiguring an entry is not 
possible on Arm.

But to answer your question, IOMMU will neither be supported with PoD 
nor access/valid bit solution. And that's fine because S/W are not 
easily virtualizable, I take that as a hint for "All the features may 
not be available when using S/W in a guest".

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-11 11:10                         ` Andre Przywara
  2017-12-11 12:15                           ` George Dunlap
@ 2017-12-11 21:11                           ` Julien Grall
  1 sibling, 0 replies; 41+ messages in thread
From: Julien Grall @ 2017-12-11 21:11 UTC (permalink / raw)
  To: Andre Przywara, George Dunlap, Marc Zyngier, Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Stefano Stabellini, Tim Deegan, xen-devel

On 12/11/2017 11:10 AM, Andre Przywara wrote:
> Hi,

Hi Andre,

> But on the other hand we had PoD naturally already in KVM, so this came
> at no cost.
> So I believe it would be worth to investigate what the actual impact is
> on booting a 32-bit kernel, with emulating s/w ops like KVM does (see
> below), but cleaning the *whole VA space*. If this is somewhat
> acceptable (I assume we have no more than 2GB for a typical ARM32
> guest), it might be worth to ignore PoD, at least for now and to solve
> this problem (and the IOMMU consequences).

I am fairly surprised you think I came up with this solution without any 
investigation. I actually clearly stated it in my first e-mail that 
Linux is not able to bring up CPU with a flush of the "whole VA space".

At the moment, Linux 32-bit as a 1 second timeout to bring up a 
secondary CPU. In that second we need to do at least a full flush (I 
think there are a second). In the case of Xen Arm32, the domain heap 
(where domain memory belongs) is not mapped in the hypervisor. So you 
end up to do mapping for every page-table and final memory. To that, you 
add the cost of doing cache maintenance. Then, you finally add the 
potential cost preemption (vCPU might be schedule out).

During my initial investigation, I was not able to boot Dom0 with 512MB. 
I tried to optimize the mapping path, but it didn't show much 
improvement in general.

Regarding the IOMMU consequences, S/W ops are not easily virtualizable. 
If you use them, then it is the price to pay. It is better than not been 
able to boot current kernel or randomly crashing.

Cheers,

-- 
Julien Grall,

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
  2017-12-11 20:26           ` Julien Grall
@ 2017-12-12  7:52             ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2017-12-12  7:52 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, George Dunlap, Andre Przywara, Tim Deegan,
	George Dunlap, Andrew Cooper, xen-devel

>>> On 11.12.17 at 21:26, <julien.grall@linaro.org> wrote:
> On 12/11/2017 10:06 AM, Jan Beulich wrote:
>>>>> On 08.12.17 at 15:38, <julien.grall@linaro.org> wrote:
>>> On 08/12/17 08:03, Tim Deegan wrote:
>>>> It should be possible to do something like the misconfigured-entry bit
>>>> trick by _allocating_ the memory up-front and building the p2m entries
>>>> but only making them usable by the {IO}MMUs on first access.  That
>>>> would make these early p2m walks shorter (because they can skip whole
>>>> subtrees that aren't marked present yet) without making major changes
>>>> to domain build or introducing run-time failures.
>>>
>>> I am not aware of any way on Arm to misconfigure an entry. We do have
>>> valid and access bits, although they will affect the IOMMU as well. So
>>> it will not be possible to get page-table sharing with this "feature"
>>> enabled.
>> 
>> How would you intend to solve the IOMMU part of the problem with
>> PoD? As was pointed out before - IOMMU and PoD are incompatible
>> on x86.
> 
> I am not sure why you ask about PoD here when I acknowledge I will look 
> at a different solution. And again, misconfiguring an entry is not 
> possible on Arm.

I'm sorry if I've overlooked any such acknowledgment; it's certainly
not in context above.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-12-12  7:52 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-05 18:39 [RFC] xen/arm: Handling cache maintenance instructions by set/way Julien Grall
2017-12-05 22:35 ` Stefano Stabellini
2017-12-05 22:54   ` Julien Grall
2017-12-06  9:15 ` Jan Beulich
2017-12-06 12:10   ` Julien Grall
2017-12-06 12:28 ` George Dunlap
2017-12-06 12:58   ` Julien Grall
2017-12-06 13:01     ` Julien Grall
2017-12-06 15:15     ` Jan Beulich
2017-12-06 17:52       ` Julien Grall
2017-12-07  9:39         ` Jan Beulich
2017-12-07 15:22           ` Julien Grall
2017-12-07 15:49             ` Jan Beulich
2017-12-06 17:49     ` George Dunlap
2017-12-07 13:52       ` Julien Grall
2017-12-07 14:25         ` Jan Beulich
2017-12-07 14:53         ` Marc Zyngier
2017-12-07 15:45           ` Jan Beulich
2017-12-07 16:04             ` Marc Zyngier
2017-12-07 16:04             ` Julien Grall
2017-12-07 16:44               ` George Dunlap
2017-12-07 16:58                 ` Marc Zyngier
2017-12-07 18:06                   ` George Dunlap
2017-12-07 19:21                     ` Marc Zyngier
2017-12-08 10:56                       ` George Dunlap
2017-12-11 11:10                         ` Andre Przywara
2017-12-11 12:15                           ` George Dunlap
2017-12-11 21:11                           ` Julien Grall
2017-12-08  8:03     ` Tim Deegan
2017-12-08 14:38       ` Julien Grall
2017-12-10 15:22         ` Tim Deegan
2017-12-11 19:50           ` Julien Grall
2017-12-11 10:06         ` Jan Beulich
2017-12-11 11:11           ` Andrew Cooper
2017-12-11 11:58             ` Jan Beulich
2017-12-11 20:26           ` Julien Grall
2017-12-12  7:52             ` Jan Beulich
2017-12-06 15:10 ` Konrad Rzeszutek Wilk
2017-12-06 15:19   ` Julien Grall
2017-12-06 15:24     ` George Dunlap
2017-12-06 15:26       ` Julien Grall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.