* non-x86 per-task stack canaries @ 2017-06-26 21:04 ` Kees Cook 0 siblings, 0 replies; 9+ messages in thread From: Kees Cook @ 2017-06-26 21:04 UTC (permalink / raw) To: kernel-hardening; +Cc: LKML, linux-arm-kernel Hi, The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the percpu area) for __stack_chk_guard, and all other architectures use a global variable instead. This means we never change the stack canary on non-x86 architectures which allows for a leak in one task to expose the canary in another task. I'm curious what thoughts people may have about how to get this correctly implemented. Teaching the compiler about per-cpu data sounds exciting. :) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 9+ messages in thread
* [kernel-hardening] non-x86 per-task stack canaries @ 2017-06-26 21:04 ` Kees Cook 0 siblings, 0 replies; 9+ messages in thread From: Kees Cook @ 2017-06-26 21:04 UTC (permalink / raw) To: kernel-hardening; +Cc: LKML, linux-arm-kernel Hi, The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the percpu area) for __stack_chk_guard, and all other architectures use a global variable instead. This means we never change the stack canary on non-x86 architectures which allows for a leak in one task to expose the canary in another task. I'm curious what thoughts people may have about how to get this correctly implemented. Teaching the compiler about per-cpu data sounds exciting. :) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 9+ messages in thread
* non-x86 per-task stack canaries @ 2017-06-26 21:04 ` Kees Cook 0 siblings, 0 replies; 9+ messages in thread From: Kees Cook @ 2017-06-26 21:04 UTC (permalink / raw) To: linux-arm-kernel Hi, The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the percpu area) for __stack_chk_guard, and all other architectures use a global variable instead. This means we never change the stack canary on non-x86 architectures which allows for a leak in one task to expose the canary in another task. I'm curious what thoughts people may have about how to get this correctly implemented. Teaching the compiler about per-cpu data sounds exciting. :) -Kees -- Kees Cook Pixel Security ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [kernel-hardening] non-x86 per-task stack canaries 2017-06-26 21:04 ` Kees Cook (?) @ 2017-06-26 22:52 ` Daniel Micay -1 siblings, 0 replies; 9+ messages in thread From: Daniel Micay @ 2017-06-26 22:52 UTC (permalink / raw) To: Kees Cook, kernel-hardening; +Cc: LKML, linux-arm-kernel, Mark Rutland On Mon, 2017-06-26 at 14:04 -0700, Kees Cook wrote: > Hi, > > The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the > percpu area) for __stack_chk_guard, and all other architectures use a > global variable instead. This means we never change the stack canary > on non-x86 architectures which allows for a leak in one task to expose > the canary in another task. > > I'm curious what thoughts people may have about how to get this > correctly implemented. Teaching the compiler about per-cpu data sounds > exciting. :) > > -Kees arm64 has many integer registers so I don't think reserving one would hurt performance, especially in the kernel where hot numeric loops barely exist. It would reduce the cost of SSP by getting rid of the memory read for the canary value. On the other hand, using per-cpu data would likely be higher cost than the global. x86 has segment registers but most archs probably need to do something more painful. It's safe as long as it's a callee-saved register. It should be enforced that there's no assembly spilling it and calling into C code without the random canary. There's very little assembly using registers like x28 so it wouldn't be that bad. It's possible there's one where nothing needs to be changed, there only needs to be a check to make sure it stays that way. It would be a step towards making SSP cheap enough to expand it into a feature like the StackGuard XOR canaries. Samsung has a return address XOR feature based on reserving a register and while RAP's probabilistic return address mitigation isn't open- source, it was stated that it reserves a register on x86_64 where they aren't as plentiful as arm64. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [kernel-hardening] non-x86 per-task stack canaries @ 2017-06-26 22:52 ` Daniel Micay 0 siblings, 0 replies; 9+ messages in thread From: Daniel Micay @ 2017-06-26 22:52 UTC (permalink / raw) To: Kees Cook, kernel-hardening; +Cc: LKML, linux-arm-kernel, Mark Rutland On Mon, 2017-06-26 at 14:04 -0700, Kees Cook wrote: > Hi, > > The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the > percpu area) for __stack_chk_guard, and all other architectures use a > global variable instead. This means we never change the stack canary > on non-x86 architectures which allows for a leak in one task to expose > the canary in another task. > > I'm curious what thoughts people may have about how to get this > correctly implemented. Teaching the compiler about per-cpu data sounds > exciting. :) > > -Kees arm64 has many integer registers so I don't think reserving one would hurt performance, especially in the kernel where hot numeric loops barely exist. It would reduce the cost of SSP by getting rid of the memory read for the canary value. On the other hand, using per-cpu data would likely be higher cost than the global. x86 has segment registers but most archs probably need to do something more painful. It's safe as long as it's a callee-saved register. It should be enforced that there's no assembly spilling it and calling into C code without the random canary. There's very little assembly using registers like x28 so it wouldn't be that bad. It's possible there's one where nothing needs to be changed, there only needs to be a check to make sure it stays that way. It would be a step towards making SSP cheap enough to expand it into a feature like the StackGuard XOR canaries. Samsung has a return address XOR feature based on reserving a register and while RAP's probabilistic return address mitigation isn't open- source, it was stated that it reserves a register on x86_64 where they aren't as plentiful as arm64. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [kernel-hardening] non-x86 per-task stack canaries @ 2017-06-26 22:52 ` Daniel Micay 0 siblings, 0 replies; 9+ messages in thread From: Daniel Micay @ 2017-06-26 22:52 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2017-06-26 at 14:04 -0700, Kees Cook wrote: > Hi, > > The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the > percpu area) for __stack_chk_guard, and all other architectures use a > global variable instead. This means we never change the stack canary > on non-x86 architectures which allows for a leak in one task to expose > the canary in another task. > > I'm curious what thoughts people may have about how to get this > correctly implemented. Teaching the compiler about per-cpu data sounds > exciting. :) > > -Kees arm64 has many integer registers so I don't think reserving one would hurt performance, especially in the kernel where hot numeric loops barely exist. It would reduce the cost of SSP by getting rid of the memory read for the canary value. On the other hand, using per-cpu data would likely be higher cost than the global. x86 has segment registers but most archs probably need to do something more painful. It's safe as long as it's a callee-saved register. It should be enforced that there's no assembly spilling it and calling into C code without the random canary. There's very little assembly using registers like x28 so it wouldn't be that bad. It's possible there's one where nothing needs to be changed, there only needs to be a check to make sure it stays that way. It would be a step towards making SSP cheap enough to expand it into a feature like the StackGuard XOR canaries. Samsung has a return address XOR feature based on reserving a register and while RAP's probabilistic return address mitigation isn't open- source, it was stated that it reserves a register on x86_64 where they aren't as plentiful as arm64. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [kernel-hardening] non-x86 per-task stack canaries 2017-06-26 22:52 ` Daniel Micay (?) @ 2017-06-27 10:06 ` Mark Rutland -1 siblings, 0 replies; 9+ messages in thread From: Mark Rutland @ 2017-06-27 10:06 UTC (permalink / raw) To: Daniel Micay; +Cc: Kees Cook, kernel-hardening, LKML, linux-arm-kernel On Mon, Jun 26, 2017 at 06:52:31PM -0400, Daniel Micay wrote: > On Mon, 2017-06-26 at 14:04 -0700, Kees Cook wrote: > > Hi, > > > > The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the > > percpu area) for __stack_chk_guard, and all other architectures use a > > global variable instead. This means we never change the stack canary > > on non-x86 architectures which allows for a leak in one task to expose > > the canary in another task. FWIW, I'd love to have per-task canaries on arm64. > > I'm curious what thoughts people may have about how to get this > > correctly implemented. Teaching the compiler about per-cpu data sounds > > exciting. :) On concern I'd have is that it's possible/likely that we'll want to change the way we handle per-cpu offsets in future. One specific reason is that we may need to shuffle the way we use TPIDR_EL1 and SP_EL0 to allow us to implement stack overflow handling on arm64 usnig EL1t mode. It would be beneficial if we could somehow avoid baking this detail into the compiler. For example, by having an inlinable callback to load the canary, or adding the protection using a plugin that we control. > arm64 has many integer registers so I don't think reserving one would > hurt performance, especially in the kernel where hot numeric loops > barely exist. A while back I did experiments with an ancient GCC, reserving single GPRs with -ffixed. For a kernel compile workload, with said ancient GCC, reserving the register had a small, but noisy impact. With more recent GCCs it was much more noisy, and it looked like it was liable to adversely affect performance. We'd need numbers across a few GCC versions (and clang too, I guess). > It would reduce the cost of SSP by getting rid of the memory read for > the canary value. On the other hand, using per-cpu data would likely > be higher cost than the global. x86 has segment registers but most > archs probably need to do something more painful. I had a prototype [1] that used the reserved GPR to hold the per-cpu offset. That allow access to per-cpu data using plain loads/stores with a register-offset addressing mode. If your arch has an addressing mode that takes a base register and an offset register, you can use a GPR in place of x86's segment register. That should benefit most this_cpu_*() ops, as it's no longer necessary to disable preemption for address generation, and is likely preferable to using it for the canary alone. Atomics are more complex, as those can be LL/SC and/or have limited addressing modes, but those are both solvable. > It's safe as long as it's a callee-saved register. It should be enforced > that there's no assembly spilling it and calling into C code without the > random canary. There's very little assembly using registers like x28 so > it wouldn't be that bad. It's possible there's one where nothing needs > to be changed, there only needs to be a check to make sure it stays that > way. IIRC, the exception entry paths need to be altered to set up the GPR, but that was about it. EFI runtime services are outside of our control and might spill any callee-saved registers, so we'd need to restore the GPR upon exceptions from EL1. Luckily (AFAIK) those don't call back into the kernel otherwise. The AAPCS reserves x18 as a platform register for special usage, and this might be the best choice. For example the EFI spec says that runtime services mustn't touch this (though I can believe there's buggy code which does). > It would be a step towards making SSP cheap enough to expand it into a > feature like the StackGuard XOR canaries. > > Samsung has a return address XOR feature based on reserving a register > and while RAP's probabilistic return address mitigation isn't open- > source, it was stated that it reserves a register on x86_64 where they > aren't as plentiful as arm64. Thanks, Mark. [1] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/this-cpu-reg ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [kernel-hardening] non-x86 per-task stack canaries @ 2017-06-27 10:06 ` Mark Rutland 0 siblings, 0 replies; 9+ messages in thread From: Mark Rutland @ 2017-06-27 10:06 UTC (permalink / raw) To: Daniel Micay; +Cc: Kees Cook, kernel-hardening, LKML, linux-arm-kernel On Mon, Jun 26, 2017 at 06:52:31PM -0400, Daniel Micay wrote: > On Mon, 2017-06-26 at 14:04 -0700, Kees Cook wrote: > > Hi, > > > > The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the > > percpu area) for __stack_chk_guard, and all other architectures use a > > global variable instead. This means we never change the stack canary > > on non-x86 architectures which allows for a leak in one task to expose > > the canary in another task. FWIW, I'd love to have per-task canaries on arm64. > > I'm curious what thoughts people may have about how to get this > > correctly implemented. Teaching the compiler about per-cpu data sounds > > exciting. :) On concern I'd have is that it's possible/likely that we'll want to change the way we handle per-cpu offsets in future. One specific reason is that we may need to shuffle the way we use TPIDR_EL1 and SP_EL0 to allow us to implement stack overflow handling on arm64 usnig EL1t mode. It would be beneficial if we could somehow avoid baking this detail into the compiler. For example, by having an inlinable callback to load the canary, or adding the protection using a plugin that we control. > arm64 has many integer registers so I don't think reserving one would > hurt performance, especially in the kernel where hot numeric loops > barely exist. A while back I did experiments with an ancient GCC, reserving single GPRs with -ffixed. For a kernel compile workload, with said ancient GCC, reserving the register had a small, but noisy impact. With more recent GCCs it was much more noisy, and it looked like it was liable to adversely affect performance. We'd need numbers across a few GCC versions (and clang too, I guess). > It would reduce the cost of SSP by getting rid of the memory read for > the canary value. On the other hand, using per-cpu data would likely > be higher cost than the global. x86 has segment registers but most > archs probably need to do something more painful. I had a prototype [1] that used the reserved GPR to hold the per-cpu offset. That allow access to per-cpu data using plain loads/stores with a register-offset addressing mode. If your arch has an addressing mode that takes a base register and an offset register, you can use a GPR in place of x86's segment register. That should benefit most this_cpu_*() ops, as it's no longer necessary to disable preemption for address generation, and is likely preferable to using it for the canary alone. Atomics are more complex, as those can be LL/SC and/or have limited addressing modes, but those are both solvable. > It's safe as long as it's a callee-saved register. It should be enforced > that there's no assembly spilling it and calling into C code without the > random canary. There's very little assembly using registers like x28 so > it wouldn't be that bad. It's possible there's one where nothing needs > to be changed, there only needs to be a check to make sure it stays that > way. IIRC, the exception entry paths need to be altered to set up the GPR, but that was about it. EFI runtime services are outside of our control and might spill any callee-saved registers, so we'd need to restore the GPR upon exceptions from EL1. Luckily (AFAIK) those don't call back into the kernel otherwise. The AAPCS reserves x18 as a platform register for special usage, and this might be the best choice. For example the EFI spec says that runtime services mustn't touch this (though I can believe there's buggy code which does). > It would be a step towards making SSP cheap enough to expand it into a > feature like the StackGuard XOR canaries. > > Samsung has a return address XOR feature based on reserving a register > and while RAP's probabilistic return address mitigation isn't open- > source, it was stated that it reserves a register on x86_64 where they > aren't as plentiful as arm64. Thanks, Mark. [1] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/this-cpu-reg ^ permalink raw reply [flat|nested] 9+ messages in thread
* [kernel-hardening] non-x86 per-task stack canaries @ 2017-06-27 10:06 ` Mark Rutland 0 siblings, 0 replies; 9+ messages in thread From: Mark Rutland @ 2017-06-27 10:06 UTC (permalink / raw) To: linux-arm-kernel On Mon, Jun 26, 2017 at 06:52:31PM -0400, Daniel Micay wrote: > On Mon, 2017-06-26 at 14:04 -0700, Kees Cook wrote: > > Hi, > > > > The stack protector functionality on x86_64 uses %gs:0x28 (%gs is the > > percpu area) for __stack_chk_guard, and all other architectures use a > > global variable instead. This means we never change the stack canary > > on non-x86 architectures which allows for a leak in one task to expose > > the canary in another task. FWIW, I'd love to have per-task canaries on arm64. > > I'm curious what thoughts people may have about how to get this > > correctly implemented. Teaching the compiler about per-cpu data sounds > > exciting. :) On concern I'd have is that it's possible/likely that we'll want to change the way we handle per-cpu offsets in future. One specific reason is that we may need to shuffle the way we use TPIDR_EL1 and SP_EL0 to allow us to implement stack overflow handling on arm64 usnig EL1t mode. It would be beneficial if we could somehow avoid baking this detail into the compiler. For example, by having an inlinable callback to load the canary, or adding the protection using a plugin that we control. > arm64 has many integer registers so I don't think reserving one would > hurt performance, especially in the kernel where hot numeric loops > barely exist. A while back I did experiments with an ancient GCC, reserving single GPRs with -ffixed. For a kernel compile workload, with said ancient GCC, reserving the register had a small, but noisy impact. With more recent GCCs it was much more noisy, and it looked like it was liable to adversely affect performance. We'd need numbers across a few GCC versions (and clang too, I guess). > It would reduce the cost of SSP by getting rid of the memory read for > the canary value. On the other hand, using per-cpu data would likely > be higher cost than the global. x86 has segment registers but most > archs probably need to do something more painful. I had a prototype [1] that used the reserved GPR to hold the per-cpu offset. That allow access to per-cpu data using plain loads/stores with a register-offset addressing mode. If your arch has an addressing mode that takes a base register and an offset register, you can use a GPR in place of x86's segment register. That should benefit most this_cpu_*() ops, as it's no longer necessary to disable preemption for address generation, and is likely preferable to using it for the canary alone. Atomics are more complex, as those can be LL/SC and/or have limited addressing modes, but those are both solvable. > It's safe as long as it's a callee-saved register. It should be enforced > that there's no assembly spilling it and calling into C code without the > random canary. There's very little assembly using registers like x28 so > it wouldn't be that bad. It's possible there's one where nothing needs > to be changed, there only needs to be a check to make sure it stays that > way. IIRC, the exception entry paths need to be altered to set up the GPR, but that was about it. EFI runtime services are outside of our control and might spill any callee-saved registers, so we'd need to restore the GPR upon exceptions from EL1. Luckily (AFAIK) those don't call back into the kernel otherwise. The AAPCS reserves x18 as a platform register for special usage, and this might be the best choice. For example the EFI spec says that runtime services mustn't touch this (though I can believe there's buggy code which does). > It would be a step towards making SSP cheap enough to expand it into a > feature like the StackGuard XOR canaries. > > Samsung has a return address XOR feature based on reserving a register > and while RAP's probabilistic return address mitigation isn't open- > source, it was stated that it reserves a register on x86_64 where they > aren't as plentiful as arm64. Thanks, Mark. [1] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/this-cpu-reg ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-06-27 10:07 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-06-26 21:04 non-x86 per-task stack canaries Kees Cook 2017-06-26 21:04 ` [kernel-hardening] " Kees Cook 2017-06-26 21:04 ` Kees Cook 2017-06-26 22:52 ` [kernel-hardening] " Daniel Micay 2017-06-26 22:52 ` Daniel Micay 2017-06-26 22:52 ` Daniel Micay 2017-06-27 10:06 ` Mark Rutland 2017-06-27 10:06 ` Mark Rutland 2017-06-27 10:06 ` Mark Rutland
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.