linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC v2: post-init-read-only protection for data allocated dynamically
@ 2017-05-03 12:06 Igor Stoppa
       [not found] ` <70a9d4db-f374-de45-413b-65b74c59edcb@intel.com>
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Igor Stoppa @ 2017-05-03 12:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel

Hello,

please review my (longish) line of thoughts, below.

I've restructured them so that they should be easier to follow.


Observations
------------

* it is currently possible, by using prefix "__read_only", to have the
linker place a static variable into a special memory region, which will
become write-protected at the end of the init phase.

* the purpose is to write-protect data which is not expected to change,
ever, after it has been initialized.

* The mechanism used for locking down the memory region is to program
the MMU to trap writes to said region. It is fairly efficient and
HW-backed, so it doesn't introduce any major overhead, but the MMU deals
only with pages or supersets of pages, hence the need to collect all the
soon-to-be-read-only data - and only that - into the "special region".
The "__read_only" modifier is the admission ticket.

* the write-protecting feature helps supporting memory integrity in
general and can also help spotting rogue writes, whatever their origin
might be: uninitialized or expired pointers, wrong pointer arithmetic, etc.



Problem
-------

The feature is available only for *static* data - it will not work with
something like a linked list that is put together during init, for example.



Wish
----

My starting point are the policy DB of SE Linux and the LSM Hooks, but
eventually I would like to extend the protection also to other
subsystems, in a way that can be merged into mainline.



Analysis
--------

* the solution I come up with has to be as little invasive as possible,
at least for what concerns the various subsystems whose integrity I want
to enhance.

* In most, if not all, the cases that could be enhanced, the code will
be calling kmalloc/vmalloc, indicating GFP_KERNEL as the desired type of
memory.

* I suspect/hope that the various maintainer won't object too much if my
changes are limited to replacing GFP_KERNEL with some other macro, for
example what I previously called GFP_LOCKABLE, provided I can ensure that:

  -1) no penalty is introduced, at least when the extra protection
      feature is not enabled, iow nobody has to suffer from my changes.
      This means that GFP_LOCKABLE should fall back to GFP_KERNEL, when
      it's not enabled.

  -2) when the extra protection feature is enabled, the code still
      works as expected, as long as the data identified for this
      enhancement is really unmodified after init.

* In my quest for improved memory integrity, I will deal with very
different memory size being allocated, so if I start writing my own
memory allocator, starting from a page-aligned chunk of normal memory,
at best I will end up with a replica of kmalloc, at worst with something
buggy. Either way, it will be extremely harder to push other subsystems
to use it.
I probably wouldn't like it either, if I was a maintainer.

* While I do not strictly need a new memory zone, memory zones are what
kmalloc understands at the moment: AFAIK, it is not possible to tell
kmalloc from which memory pool it should fish out the memory, other than
having a reference to a memory zone.
If it was possible to aim kmalloc at arbitrary memory pools, probably we
would not be having this exchange right now. And probably there would
not be so many other folks trying to have their memory zone of interest
being merged. However I suspect this solution would be sub-optimal for
the normal use cases, because there would be the extra overhead of
passing the reference to the memory pool, instead of encoding it into
bitfields, together with other information.

* there are very slim chances (to be optimistic :) that I can get away
with having my custom zone merged, because others are trying with
similar proposals and they get rejected, so maybe I can have better luck
if I propose something that can also work for others.

* currently memory zones are mapped 1:1 to bits in crowded a bitmask,
but not all these zones are really needed in a typical real system, some
are kept for backward compatibility and supporting distros, which cannot
know upfront the quirks of the HW they will be running on.


Conclusions
-----------

* the solution that seems to be more likely to succeed is to remove the
1:1 mapping between optional zones and their respective bits.

* the bits previously assigned to the optional zones would become
available for mapping whatever zone a system integrator wants to support.


Cons:
There would be still a hard constraint on the maximum number of zones
available simultaneously, so one will have to choose which of the
optional zones to enable, and be ready to deal with own zone
disappearing (ex: fall back to normal memory and give up some/all
functionality)

Pros:
* No bit would go to waste: those who want to have own custom zone could
make a better use of the allocated-but-not-necessary-to-them bits.
* There would be a standard way for people to add non-standard zones.
* It doesn't alter the hot paths that are critical to efficient memory
handling.

So it seems a win-win scenario, apart from the fact that I will probably
have to reshuffle a certain amount of macros :-)


P.S.
There was an early advice of creating and using a custom-made memory
allocator, I hope it's now clear why I don't think it's viable: it might
work if I use it only for further code that I will write, but it really
doesn't seem the best way to convince other subsystem maintainers to
take in my changes, if I suggest them to give up the super optimized
kmalloc (and friends) in favor of some homebrew allocator I wrote :-/


---
thanks, igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
       [not found] ` <70a9d4db-f374-de45-413b-65b74c59edcb@intel.com>
@ 2017-05-04  8:17   ` Igor Stoppa
  2017-05-04 14:30     ` Dave Hansen
  0 siblings, 1 reply; 23+ messages in thread
From: Igor Stoppa @ 2017-05-04  8:17 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko; +Cc: linux-mm, linux-kernel

Hi,
I suspect this was accidentally a Reply-To instead of a Reply-All,
so I'm putting back the CCs that were dropped.

On 03/05/17 21:41, Dave Hansen wrote:
> On 05/03/2017 05:06 AM, Igor Stoppa wrote:
>> My starting point are the policy DB of SE Linux and the LSM Hooks, but
>> eventually I would like to extend the protection also to other
>> subsystems, in a way that can be merged into mainline.
> 
> Have you given any thought to just having a set of specialized slabs?

No, the idea of the RFC was to get this sort of comments about options I
might have missed :-)

> Today, for instance, we have a separate set of kmalloc() slabs for DMA:
> dma-kmalloc-{4096,2048,...}.  It should be quite possible to have
> another set for your post-init-read-only protected data.

I will definitely investigate it and report back, thanks.
But In the meanwhile I'd appreciate further clarifications.
Please see below ...

> This doesn't take care of vmalloc(), but I have the feeling that
> implementing this for vmalloc() isn't going to be horribly difficult.

ok

>> * The mechanism used for locking down the memory region is to program
>> the MMU to trap writes to said region. It is fairly efficient and
>> HW-backed, so it doesn't introduce any major overhead,
> 
> I'd take a bit of an issue with this statement.  It *will* fracture
> large pages unless you manage to pack all of these allocations entirely
> within a large page.  This is problematic because we use the largest
> size available, and that's 1GB on x86.

I am not sure I fully understand this part.
I am probably missing some point about the way kmalloc works.

I get the problem you describe, but I do not understand why it should
happen.
Going back for a moment to my original idea of the zone, as a physical
address range, why wouldn't it be possible to define it as one large page?

Btw, I do not expect to have much memory occupation, in terms of sheer
size, although there might be many small "variables" scattered across
the code. That's where I hope using kmalloc, instead of a custom made
allocator can make a difference, in terms of optimal occupation.

> IOW, if you scatter these things throughout the address space, you may
> end up fracturing/demoting enough large pages to cause major overhead
> refilling the TLB.

But why would I?
Or, better, what would cause it, unless I take special care?

Or, let me put it differently: my goal is to not fracture more pages
than needed.
It will probably require some profiling to figure out what is the
ballpark of the memory footprint.

I might have overlooked some aspect of this, but the overall goal
is to have a memory range (I won't call it zone, to avoid referring to a
specific implementation) which is as tightly packed as possible, stuffed
with all the data that is expected to become read-only.

> Note that this only applies for kmalloc() allocations, *not* vmalloc()
> since kmalloc() uses the kernel linear map and vmalloc() uses it own,
> separate mappings.

Yes.

---
thanks, igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-03 12:06 RFC v2: post-init-read-only protection for data allocated dynamically Igor Stoppa
       [not found] ` <70a9d4db-f374-de45-413b-65b74c59edcb@intel.com>
@ 2017-05-04 11:21 ` Michal Hocko
  2017-05-04 12:14   ` Igor Stoppa
  2017-05-04 16:49 ` Laura Abbott
  2 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-05-04 11:21 UTC (permalink / raw)
  To: Igor Stoppa; +Cc: linux-mm, linux-kernel

On Wed 03-05-17 15:06:36, Igor Stoppa wrote:
> Hello,
> 
> please review my (longish) line of thoughts, below.
> 
> I've restructured them so that they should be easier to follow.
> 
> 
> Observations
> ------------
> 
> * it is currently possible, by using prefix "__read_only", to have the
> linker place a static variable into a special memory region, which will
> become write-protected at the end of the init phase.
> 
> * the purpose is to write-protect data which is not expected to change,
> ever, after it has been initialized.
> 
> * The mechanism used for locking down the memory region is to program
> the MMU to trap writes to said region. It is fairly efficient and
> HW-backed, so it doesn't introduce any major overhead, but the MMU deals
> only with pages or supersets of pages, hence the need to collect all the
> soon-to-be-read-only data - and only that - into the "special region".
> The "__read_only" modifier is the admission ticket.
> 
> * the write-protecting feature helps supporting memory integrity in
> general and can also help spotting rogue writes, whatever their origin
> might be: uninitialized or expired pointers, wrong pointer arithmetic, etc.

I agree that such a feature can be really useful.

> Problem
> -------
> 
> The feature is available only for *static* data - it will not work with
> something like a linked list that is put together during init, for example.
> 
> 
> 
> Wish
> ----
> 
> My starting point are the policy DB of SE Linux and the LSM Hooks, but
> eventually I would like to extend the protection also to other
> subsystems, in a way that can be merged into mainline.
> 
> 
> 
> Analysis
> --------
> 
> * the solution I come up with has to be as little invasive as possible,
> at least for what concerns the various subsystems whose integrity I want
> to enhance.
> 
> * In most, if not all, the cases that could be enhanced, the code will
> be calling kmalloc/vmalloc, indicating GFP_KERNEL as the desired type of
> memory.

How do you tell that the seal is active? I have also asked about the
life time of these objects in the previous email thread. Do you expect
those objects get freed one by one or mostly at once? Is this supposed
to be boot time only or such allocations might happen anytime?

> * I suspect/hope that the various maintainer won't object too much if my
> changes are limited to replacing GFP_KERNEL with some other macro, for
> example what I previously called GFP_LOCKABLE, provided I can ensure that:
> 
>   -1) no penalty is introduced, at least when the extra protection
>       feature is not enabled, iow nobody has to suffer from my changes.
>       This means that GFP_LOCKABLE should fall back to GFP_KERNEL, when
>       it's not enabled.
> 
>   -2) when the extra protection feature is enabled, the code still
>       works as expected, as long as the data identified for this
>       enhancement is really unmodified after init.
> 
> * In my quest for improved memory integrity, I will deal with very
> different memory size being allocated, so if I start writing my own
> memory allocator, starting from a page-aligned chunk of normal memory,
> at best I will end up with a replica of kmalloc, at worst with something
> buggy. Either way, it will be extremely harder to push other subsystems
> to use it.
> I probably wouldn't like it either, if I was a maintainer.

The most immediate suggestion would be to extend SLAB caches with a new
sealing feature. Roughly it would mean that once kmem_cache_seal() is
called on a cache it would changed page tables to used slab pages to RO
state. This would obviously need some fiddling to make those pages not
usable for new allocations from sealed pages. It would also mean some
changes to kfree path but I guess this is doable.

> * While I do not strictly need a new memory zone, memory zones are what
> kmalloc understands at the moment: AFAIK, it is not possible to tell
> kmalloc from which memory pool it should fish out the memory, other than
> having a reference to a memory zone.

As I've said already. I think that a zone is a completely wrong
approach. How would it help anyway. It is the allocator on top of the
page allocator which has to do clever things to support sealing.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 11:21 ` Michal Hocko
@ 2017-05-04 12:14   ` Igor Stoppa
  2017-05-04 13:11     ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Igor Stoppa @ 2017-05-04 12:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, Dave Hansen

On 04/05/17 14:21, Michal Hocko wrote:
> On Wed 03-05-17 15:06:36, Igor Stoppa wrote:

[...]

>> * In most, if not all, the cases that could be enhanced, the code will
>> be calling kmalloc/vmalloc, indicating GFP_KERNEL as the desired type of
>> memory.
> 
> How do you tell that the seal is active?

The simpler way would be to define the seal as something that is applied
only after late init has concluded.

IOW, if the kernel has already started user-space, the seal is in place.

I do acknowledge that this conflicts with the current implementation of
SE Linux, but it might be possible to extend SE Linux to have a
predefined configuration file that is loaded during kernel init.

In general this is not acceptable, but OTOH IMA does it, so there could
be ground also for advocating similar (optional) behavior for SE Linux.

Should that not be possible,then yes, I should provide some way (ioctl,
sysfs/something else) that can be used to apply the seal.

In such case there should be also some helper function which allows to
confirm that the seal is absent/present.

> I have also asked about the
> life time of these objects in the previous email thread. Do you expect
> those objects get freed one by one or mostly at once? Is this supposed
> to be boot time only or such allocations might happen anytime?

Yes, you did. I didn't mean to ignore the question.
I thought this question would be answered by the current RFC :-(

Alright, here's one more attempt at explaining what I have in mind.
And I might be wrong, so that would explain why it's not clear.

Once the seal is in place, the objects are effectively read-only, so the
lifetime is basically the same as the kernel text.
Since I am after providing the same functionality of
post-init-read-only, but for dynamically allocated data, I would stick
to the same behavior: once the data is read-only, it stays so forever,
or till reset/poweroff, whichever comes first.

I wonder if you are thinking about loadable modules or maybe livepatch.
My proposal, in its current form, is only about what is done when the
kernel initialization is performed. So it would not take those cases
under its umbrella. Actually it might be incompatible with livepatch, if
any of the read-only data is supposed to be updated.

Since it's meant to improve the current level of integrity, I would
prefer to have a progressive approach and address modules/livepatch in a
later phase, if this is not seen as a show stopper.

[...]

> The most immediate suggestion would be to extend SLAB caches with a new
> sealing feature.

Yes, I got few hours ago the same advice also from Dave Hansen (thanks,
btw) [1].

I had just not considered the option.

> Roughly it would mean that once kmem_cache_seal() is
> called on a cache it would changed page tables to used slab pages to RO
> state. This would obviously need some fiddling to make those pages not
> usable for new allocations from sealed pages. It would also mean some
> changes to kfree path but I guess this is doable.

Ok, as it probably has already become evident, I have just started
peeking into the memory subsystem, so this is the sort of guidance I was
hoping I could receive =) - thank you

Question: I see that some pages can be moved around. Would this apply to
the slab-based solution, or can I assume that once I have certain
physical pages sealed, they will not be altered anymore?

>> * While I do not strictly need a new memory zone, memory zones are what
>> kmalloc understands at the moment: AFAIK, it is not possible to tell
>> kmalloc from which memory pool it should fish out the memory, other than
>> having a reference to a memory zone.
> 
> As I've said already. I think that a zone is a completely wrong
> approach. How would it help anyway. It is the allocator on top of the
> page allocator which has to do clever things to support sealing.


Ok, as long as there is a way forward that fits my needs and has the
possibility to be merged upstream, I'm fine with it.

I suppose zones are the first thing one meets when reading the code, so
they are probably the first target that comes to mind.
That's what happened to me.

I will probably come back with further questions, but I can then start
putting together some prototype of what you described.

I am fine with providing a generic solution, but I must make sure that
it works with slub. I suppose what you proposed will do it, right?

TBH, from what little I have been reading so far, I find a bit confusing
the fact that there are some header files referring separately to slab,
slub and slob, but then common code still refers to slab (slab.h slab.c
and slab_common.c, for example)


[1] https://marc.info/?l=linux-kernel&m=149388596106305&w=2


---
thanks, igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 12:14   ` Igor Stoppa
@ 2017-05-04 13:11     ` Michal Hocko
  2017-05-04 13:37       ` Igor Stoppa
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-05-04 13:11 UTC (permalink / raw)
  To: Igor Stoppa; +Cc: linux-mm, linux-kernel, Dave Hansen

On Thu 04-05-17 15:14:10, Igor Stoppa wrote:
[...]
> I wonder if you are thinking about loadable modules or maybe livepatch.
> My proposal, in its current form, is only about what is done when the
> kernel initialization is performed. So it would not take those cases
> under its umbrella. Actually it might be incompatible with livepatch, if
> any of the read-only data is supposed to be updated.
> 
> Since it's meant to improve the current level of integrity, I would
> prefer to have a progressive approach and address modules/livepatch in a
> later phase, if this is not seen as a show stopper.

I believe that this is a fundamental question. Sealing sounds useful
for after-boot usecases as well and it would change the approach
considerably. Coming up with an ad-hoc solution for the boot only way
seems like a wrong way to me. And as you've said SELinux which is your
target already does the thing after the early boot.

[...]
> > Roughly it would mean that once kmem_cache_seal() is
> > called on a cache it would changed page tables to used slab pages to RO
> > state. This would obviously need some fiddling to make those pages not
> > usable for new allocations from sealed pages. It would also mean some
> > changes to kfree path but I guess this is doable.
> 
> Ok, as it probably has already become evident, I have just started
> peeking into the memory subsystem, so this is the sort of guidance I was
> hoping I could receive =) - thank you
> 
> Question: I see that some pages can be moved around. Would this apply to
> the slab-based solution, or can I assume that once I have certain
> physical pages sealed, they will not be altered anymore?

Slab pages are not migrateable currently. Even if they start being
migrateable it would be an opt-in because that requires pointers tracking
to make sure they are updated properly.
 
> >> * While I do not strictly need a new memory zone, memory zones are what
> >> kmalloc understands at the moment: AFAIK, it is not possible to tell
> >> kmalloc from which memory pool it should fish out the memory, other than
> >> having a reference to a memory zone.
> > 
> > As I've said already. I think that a zone is a completely wrong
> > approach. How would it help anyway. It is the allocator on top of the
> > page allocator which has to do clever things to support sealing.
> 
> 
> Ok, as long as there is a way forward that fits my needs and has the
> possibility to be merged upstream, I'm fine with it.
> 
> I suppose zones are the first thing one meets when reading the code, so
> they are probably the first target that comes to mind.
> That's what happened to me.
> 
> I will probably come back with further questions, but I can then start
> putting together some prototype of what you described.
> 
> I am fine with providing a generic solution, but I must make sure that
> it works with slub. I suppose what you proposed will do it, right?

I haven't researched that too deeply. In principle both SLAB and SLUB
maintain slab pages in a similar way so I do not see any fundamental
problems.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 13:11     ` Michal Hocko
@ 2017-05-04 13:37       ` Igor Stoppa
  2017-05-04 14:01         ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Igor Stoppa @ 2017-05-04 13:37 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, Dave Hansen

On 04/05/17 16:11, Michal Hocko wrote:
> On Thu 04-05-17 15:14:10, Igor Stoppa wrote:

> I believe that this is a fundamental question. Sealing sounds useful
> for after-boot usecases as well and it would change the approach
> considerably. Coming up with an ad-hoc solution for the boot only way
> seems like a wrong way to me. And as you've said SELinux which is your
> target already does the thing after the early boot.

I didn't spend too many thoughts on this so far, because the zone-based
approach seemed almost doomed, so I wanted to wait for the evolution of
the discussion :-)

The main question here is granularity, I think.

At least, as first cut, the simpler approach would be to have a master
toggle: when some legitimate operation needs to happen, the seal is
lifted across the entire range, then it is put back in place, once the
operation has concluded.

Simplicity is the main advantage.

The disadvantage is that anything can happen, undetected, while the seal
is lifted.
OTOH the amount of code that could backfire should be fairly limited, so
it doesn't seem a huge issue to me.

The alternative would be to somehow know what a write will change and
make only the appropriate page(s) writable. But it seems overkill to me.
Especially because in some cases, with huge pages, everything would fit
anyway in one page.

One more option that comes to mind - but I do not know how realistic it
would be - is to have multiple slabs, to be used for different purposes.
Ex: one for the monolithic kernel and one for modules.
It wouldn't help for livepatch, though, as it can modify both, so both
would have to be unprotected.

But live-patching is potentially a far less frequent event than module
loading/unloading (thinking about USB gadgets, for example).

[...]

> Slab pages are not migrateable currently. Even if they start being
> migrateable it would be an opt-in because that requires pointers tracking
> to make sure they are updated properly.

ok

[...]

> I haven't researched that too deeply. In principle both SLAB and SLUB
> maintain slab pages in a similar way so I do not see any fundamental
> problems.


good, then I could proceed with the prototype, if there are no further
objections/questions and we agree that, implementation aside, there are
no obvious fundamental problems preventing the merge


---
thanks, igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 13:37       ` Igor Stoppa
@ 2017-05-04 14:01         ` Michal Hocko
  2017-05-04 17:24           ` Dave Hansen
  2017-05-05 12:19           ` Igor Stoppa
  0 siblings, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2017-05-04 14:01 UTC (permalink / raw)
  To: Igor Stoppa; +Cc: linux-mm, linux-kernel, Dave Hansen

On Thu 04-05-17 16:37:55, Igor Stoppa wrote:
> On 04/05/17 16:11, Michal Hocko wrote:
> > On Thu 04-05-17 15:14:10, Igor Stoppa wrote:
> 
> > I believe that this is a fundamental question. Sealing sounds useful
> > for after-boot usecases as well and it would change the approach
> > considerably. Coming up with an ad-hoc solution for the boot only way
> > seems like a wrong way to me. And as you've said SELinux which is your
> > target already does the thing after the early boot.
> 
> I didn't spend too many thoughts on this so far, because the zone-based
> approach seemed almost doomed, so I wanted to wait for the evolution of
> the discussion :-)
> 
> The main question here is granularity, I think.
> 
> At least, as first cut, the simpler approach would be to have a master
> toggle: when some legitimate operation needs to happen, the seal is
> lifted across the entire range, then it is put back in place, once the
> operation has concluded.
> 
> Simplicity is the main advantage.
> 
> The disadvantage is that anything can happen, undetected, while the seal
> is lifted.

Yes and I think this makes it basically pointless

> OTOH the amount of code that could backfire should be fairly limited, so
> it doesn't seem a huge issue to me.
> 
> The alternative would be to somehow know what a write will change and
> make only the appropriate page(s) writable. But it seems overkill to me.
> Especially because in some cases, with huge pages, everything would fit
> anyway in one page.
> 
> One more option that comes to mind - but I do not know how realistic it
> would be - is to have multiple slabs, to be used for different purposes.
> Ex: one for the monolithic kernel and one for modules.
> It wouldn't help for livepatch, though, as it can modify both, so both
> would have to be unprotected.

Just to make my proposal more clear. I suggest the following workflow

cache = kmem_cache_create(foo, object_size, ..., SLAB_SEAL);

obj = kmem_cache_alloc(cache, gfp_mask);
init_obj(obj)
[more allocations]
kmem_cache_seal(cache);

All slab pages belonging to the cache would get write protection. All
new allocations from this cache would go to new slab pages. Later
kmem_cache_seal will write protect only those new pages.

The main discomfort with this approach is that you have to create those
caches in advance, obviously. We could help by creating some general
purpose caches for common sizes but this sound like an overkill to me.
The caller will know which objects will need the protection so the
appropriate cache can be created on demand. But this reall depends on
potential users...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04  8:17   ` Igor Stoppa
@ 2017-05-04 14:30     ` Dave Hansen
  2017-05-05  8:53       ` Igor Stoppa
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2017-05-04 14:30 UTC (permalink / raw)
  To: Igor Stoppa, Michal Hocko; +Cc: linux-mm, linux-kernel

On 05/04/2017 01:17 AM, Igor Stoppa wrote:
> Or, let me put it differently: my goal is to not fracture more pages
> than needed.
> It will probably require some profiling to figure out what is the
> ballpark of the memory footprint.

This is easy to say, but hard to do.  What if someone loads a different
set of LSMs, or uses a very different configuration?  How could this
possibly work generally without vastly over-reserving in most cases?

> I might have overlooked some aspect of this, but the overall goal
> is to have a memory range (I won't call it zone, to avoid referring to a
> specific implementation) which is as tightly packed as possible, stuffed
> with all the data that is expected to become read-only.

I'm starting with the assumption that a new zone isn't feasible. :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-03 12:06 RFC v2: post-init-read-only protection for data allocated dynamically Igor Stoppa
       [not found] ` <70a9d4db-f374-de45-413b-65b74c59edcb@intel.com>
  2017-05-04 11:21 ` Michal Hocko
@ 2017-05-04 16:49 ` Laura Abbott
  2017-05-05 10:42   ` Igor Stoppa
  2 siblings, 1 reply; 23+ messages in thread
From: Laura Abbott @ 2017-05-04 16:49 UTC (permalink / raw)
  To: Igor Stoppa, Michal Hocko; +Cc: linux-mm, linux-kernel, kernel-hardening

[adding kernel-hardening since I think there would be interest]

On 05/03/2017 05:06 AM, Igor Stoppa wrote:
> Hello,
> 
> please review my (longish) line of thoughts, below.
> 
> I've restructured them so that they should be easier to follow.
> 
> 
> Observations
> ------------
> 
> * it is currently possible, by using prefix "__read_only", to have the
> linker place a static variable into a special memory region, which will
> become write-protected at the end of the init phase.
> 
> * the purpose is to write-protect data which is not expected to change,
> ever, after it has been initialized.
> 
> * The mechanism used for locking down the memory region is to program
> the MMU to trap writes to said region. It is fairly efficient and
> HW-backed, so it doesn't introduce any major overhead, but the MMU deals
> only with pages or supersets of pages, hence the need to collect all the
> soon-to-be-read-only data - and only that - into the "special region".
> The "__read_only" modifier is the admission ticket.
> 
> * the write-protecting feature helps supporting memory integrity in
> general and can also help spotting rogue writes, whatever their origin
> might be: uninitialized or expired pointers, wrong pointer arithmetic, etc.
> 
> 
> 
> Problem
> -------
> 
> The feature is available only for *static* data - it will not work with
> something like a linked list that is put together during init, for example.
> 
> 
> 
> Wish
> ----
> 
> My starting point are the policy DB of SE Linux and the LSM Hooks, but
> eventually I would like to extend the protection also to other
> subsystems, in a way that can be merged into mainline.
> 
> 
> 
> Analysis
> --------
> 
> * the solution I come up with has to be as little invasive as possible,
> at least for what concerns the various subsystems whose integrity I want
> to enhance.
> 
> * In most, if not all, the cases that could be enhanced, the code will
> be calling kmalloc/vmalloc, indicating GFP_KERNEL as the desired type of
> memory.
> 
> * I suspect/hope that the various maintainer won't object too much if my
> changes are limited to replacing GFP_KERNEL with some other macro, for
> example what I previously called GFP_LOCKABLE, provided I can ensure that:
> 
>   -1) no penalty is introduced, at least when the extra protection
>       feature is not enabled, iow nobody has to suffer from my changes.
>       This means that GFP_LOCKABLE should fall back to GFP_KERNEL, when
>       it's not enabled.
> 
>   -2) when the extra protection feature is enabled, the code still
>       works as expected, as long as the data identified for this
>       enhancement is really unmodified after init.
> 
> * In my quest for improved memory integrity, I will deal with very
> different memory size being allocated, so if I start writing my own
> memory allocator, starting from a page-aligned chunk of normal memory,
> at best I will end up with a replica of kmalloc, at worst with something
> buggy. Either way, it will be extremely harder to push other subsystems
> to use it.
> I probably wouldn't like it either, if I was a maintainer.
> 
> * While I do not strictly need a new memory zone, memory zones are what
> kmalloc understands at the moment: AFAIK, it is not possible to tell
> kmalloc from which memory pool it should fish out the memory, other than
> having a reference to a memory zone.
> If it was possible to aim kmalloc at arbitrary memory pools, probably we
> would not be having this exchange right now. And probably there would
> not be so many other folks trying to have their memory zone of interest
> being merged. However I suspect this solution would be sub-optimal for
> the normal use cases, because there would be the extra overhead of
> passing the reference to the memory pool, instead of encoding it into
> bitfields, together with other information.
> 
> * there are very slim chances (to be optimistic :) that I can get away
> with having my custom zone merged, because others are trying with
> similar proposals and they get rejected, so maybe I can have better luck
> if I propose something that can also work for others.
> 
> * currently memory zones are mapped 1:1 to bits in crowded a bitmask,
> but not all these zones are really needed in a typical real system, some
> are kept for backward compatibility and supporting distros, which cannot
> know upfront the quirks of the HW they will be running on.
> 
> 
> Conclusions
> -----------
> 
> * the solution that seems to be more likely to succeed is to remove the
> 1:1 mapping between optional zones and their respective bits.
> 
> * the bits previously assigned to the optional zones would become
> available for mapping whatever zone a system integrator wants to support.
> 
> 
> Cons:
> There would be still a hard constraint on the maximum number of zones
> available simultaneously, so one will have to choose which of the
> optional zones to enable, and be ready to deal with own zone
> disappearing (ex: fall back to normal memory and give up some/all
> functionality)
> 
> Pros:
> * No bit would go to waste: those who want to have own custom zone could
> make a better use of the allocated-but-not-necessary-to-them bits.
> * There would be a standard way for people to add non-standard zones.
> * It doesn't alter the hot paths that are critical to efficient memory
> handling.
> 
> So it seems a win-win scenario, apart from the fact that I will probably
> have to reshuffle a certain amount of macros :-)
> 
> 
> P.S.
> There was an early advice of creating and using a custom-made memory
> allocator, I hope it's now clear why I don't think it's viable: it might
> work if I use it only for further code that I will write, but it really
> doesn't seem the best way to convince other subsystem maintainers to
> take in my changes, if I suggest them to give up the super optimized
> kmalloc (and friends) in favor of some homebrew allocator I wrote :-/
> 
> 

BPF takes the approach of calling set_memory_ro to mark regions as
read only. I'm certainly over simplifying but it sounds like this
is mostly a mechanism to have this happen mostly automatically.
Can you provide any more details about tradeoffs of the two approaches?

arm and arm64 have the added complexity of using larger
page sizes on the linear map so dynamic mapping/unmapping generally
doesn't work. arm64 supports DEBUG_PAGEALLOC by mapping with only
pages but this is generally only wanted as a debug mechanism.
I don't know if you've given this any thought at all.

Thanks,
Laura

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 14:01         ` Michal Hocko
@ 2017-05-04 17:24           ` Dave Hansen
  2017-05-05 12:08             ` Igor Stoppa
  2017-05-05 12:19           ` Igor Stoppa
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2017-05-04 17:24 UTC (permalink / raw)
  To: Michal Hocko, Igor Stoppa; +Cc: linux-mm, linux-kernel

On 05/04/2017 07:01 AM, Michal Hocko wrote:
> Just to make my proposal more clear. I suggest the following workflow
> 
> cache = kmem_cache_create(foo, object_size, ..., SLAB_SEAL);
> 
> obj = kmem_cache_alloc(cache, gfp_mask);
> init_obj(obj)
> [more allocations]
> kmem_cache_seal(cache);
> 
> All slab pages belonging to the cache would get write protection. All
> new allocations from this cache would go to new slab pages. Later
> kmem_cache_seal will write protect only those new pages.

Igor, what sizes of objects are you after here, mostly?

I ask because slub, at least, doesn't work at all for objects
>PAGE_SIZE.  It just punts those to the page allocator.  But, you
_could_ still use vmalloc() for those.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 14:30     ` Dave Hansen
@ 2017-05-05  8:53       ` Igor Stoppa
  0 siblings, 0 replies; 23+ messages in thread
From: Igor Stoppa @ 2017-05-05  8:53 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko; +Cc: linux-mm, linux-kernel

On 04/05/17 17:30, Dave Hansen wrote:
> On 05/04/2017 01:17 AM, Igor Stoppa wrote:
>> Or, let me put it differently: my goal is to not fracture more pages
>> than needed.
>> It will probably require some profiling to figure out what is the
>> ballpark of the memory footprint.
> 
> This is easy to say, but hard to do.  What if someone loads a different
> set of LSMs, or uses a very different configuration?  How could this
> possibly work generally without vastly over-reserving in most cases?

I am probably making some implicit assumptions.
Let me try to make them explicit and let's see if they survive public
scrutiny, and btw while writing it, I see that it probably won't :-S

Observations
------------

* The memory that might need sealing is less or equal to the total
  r/w memory - whatever that might be.

* In practice only a subset of the r/w memory will qualify for sealing.

* The over-reserving might be abysmal, in terms of percentage of
  actually used memory, but it might not affect too much the system, in
  absolute terms.

* On my machine (Ubuntu 16.10 64bit):

  ~$ dmesg |grep Memory
  [    0.000000] Memory: 32662956K/33474640K available (8848K kernel
  code, 1441K rwdata, 3828K rodata, 1552K init, 1296K bss, 811684K
  reserved, 0K cma-reserved)

* This is the memory at boot, I am not sure what would be the right way
to get the same info at runtime.


Speculations
------------

* after loading enough modules, the rwdata is 2-3 times larger

* the amount of rwdata that can be converted to rodata is 50%;
  this is purely a working assumption I am making, as I have no
  measurement yet and needs to be revised.

* on a system like mine, it would mean 2-3 MB


Conclusions
-----------

* 2-3 MB with possibly 50% of utilization might be an acceptable
compromise for a distro - as user I probably wouldn't mind too much.

* if someone is not happy with the distro defaults, every major distro
provides means to reconfigure and rebuild its kernel (the expectation is
that the only distro users who are not happy are those who would
probably reconfigure the kernel anyway, like a data center)

* non distro-users, like mobile, embedded, IoT, etc would do
optimizations and tweaking also without this feature mandating it.

--

In my defense, I can only say that my idea for this feature was to make
it as opt-in, where if one chooses to enable it, it is known upfront
what it will entail.
Now we are talking about distros, with the feature enabled by default.

TBH I am not sure there even is a truly generic solution, because we are
talking about dynamically allocated data, where the amount is not known
upfront (if it was, probably the data would be static).


I have the impression that it's a situation like:
- efficient memory occupation
- no need for profiling
- non fragmented pages

Choose 2 of them.


Of course, there might be a better way, but I haven't found it yet,
other than the usual way out: make it a command line option and let the
unhappy user modify the command line that the bootloader passes to the
kernel.

[...]

> I'm starting with the assumption that a new zone isn't feasible. :)

I really have no bias: I have a problem and I am trying to solve it.
I think the improvement could be useful also for others.

If the problem can be solved in a better way than what I proposed, it is
still good for me.

---
igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 16:49 ` Laura Abbott
@ 2017-05-05 10:42   ` Igor Stoppa
  2017-05-08 15:25     ` Laura Abbott
  2017-05-10  8:05     ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Igor Stoppa @ 2017-05-05 10:42 UTC (permalink / raw)
  To: Laura Abbott, Michal Hocko; +Cc: linux-mm, linux-kernel, kernel-hardening

On 04/05/17 19:49, Laura Abbott wrote:
> [adding kernel-hardening since I think there would be interest]

thank you, I overlooked this


> BPF takes the approach of calling set_memory_ro to mark regions as
> read only. I'm certainly over simplifying but it sounds like this
> is mostly a mechanism to have this happen mostly automatically.
> Can you provide any more details about tradeoffs of the two approaches?

I am not sure I understand the question ...
For what I can understand, the bpf is marking as read only something
that spans across various pages, which is fine.
The payload to be protected is already organized in such pages.

But in the case I have in mind, I have various, heterogeneous chunks of
data, coming from various subsystems, not necessarily page aligned.
And, even if they were page aligned, most likely they would be far
smaller than a page, even a 4k page.

The first problem I see, is how to compact them into pages, ensuring
that no rwdata manages to infiltrate the range.

The actual mechanism for marking pages as read only is not relevant at
this point, if I understand your question correctly, since set_memory_ro
is walking the pages it receives as parameter.

> arm and arm64 have the added complexity of using larger
> page sizes on the linear map so dynamic mapping/unmapping generally
> doesn't work. 

Do you mean that a page could be 16MB and therefore it would not be
possible to get a smaller chunk?

> arm64 supports DEBUG_PAGEALLOC by mapping with only
> pages but this is generally only wanted as a debug mechanism.
> I don't know if you've given this any thought at all.

Since the beginning I have thought about this feature as an opt-in
feature. I am aware that it can have drawbacks, but I think it would be
valuable as debugging tool even where it's not feasible to keep it
always-on.

OTOH on certain systems it can be sufficiently appealing to be kept on,
even if it eats up some more memory.

If this doesn't answer your question, could you please detail it more?

---
thanks, igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 17:24           ` Dave Hansen
@ 2017-05-05 12:08             ` Igor Stoppa
  0 siblings, 0 replies; 23+ messages in thread
From: Igor Stoppa @ 2017-05-05 12:08 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko; +Cc: linux-mm, linux-kernel

On 04/05/17 20:24, Dave Hansen wrote:
> On 05/04/2017 07:01 AM, Michal Hocko wrote:
>> Just to make my proposal more clear. I suggest the following workflow
>>
>> cache = kmem_cache_create(foo, object_size, ..., SLAB_SEAL);
>>
>> obj = kmem_cache_alloc(cache, gfp_mask);
>> init_obj(obj)
>> [more allocations]
>> kmem_cache_seal(cache);
>>
>> All slab pages belonging to the cache would get write protection. All
>> new allocations from this cache would go to new slab pages. Later
>> kmem_cache_seal will write protect only those new pages.
> 
> Igor, what sizes of objects are you after here, mostly?

Theoretically, anything, since I have not really looked in details into
all the various subsystems, however, taking a more pragmatical approach
and referring to SE Linux and LSM Hooks, which were my initial target,

For SE Linux, I'm taking as example the policy db [1]:
The sizes are mostly small-ish: from 4-6 bytes to 16-32, overall.
There are some exceptions: the main policydb structure is way larger,
but it's not supposed to be instantiated repeatedly.


For LSM Hooks, the sublists in that hydra which goes under the name of
struct security_hook_heads, which are of type struct security_hook_list,
so a handful of bytes for the generic element [2].


> I ask because slub, at least, doesn't work at all for objects
>> PAGE_SIZE.  It just punts those to the page allocator.  But, you
> _could_ still use vmalloc() for those.


I would be surprised to find many objects that are larger than PAGE_SIZE
and qqualify for post-init-read-only protection,  even if the page size
was only 4kB.

>From that perspective, I'm more concerned about avoiding taking a lot of
pages and leaving them mostly unused.

[1] security/selinux/ss/policydb.h
[2] include/linux/lsm_hooks.h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-04 14:01         ` Michal Hocko
  2017-05-04 17:24           ` Dave Hansen
@ 2017-05-05 12:19           ` Igor Stoppa
  2017-05-10  7:45             ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Igor Stoppa @ 2017-05-05 12:19 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, Dave Hansen



On 04/05/17 17:01, Michal Hocko wrote:
> On Thu 04-05-17 16:37:55, Igor Stoppa wrote:

[...]

>> The disadvantage is that anything can happen, undetected, while the seal
>> is lifted.
> 
> Yes and I think this makes it basically pointless

ok, this goes a bit beyond what I had in mind initially, but I see your
point

[...]

> Just to make my proposal more clear. I suggest the following workflow
> 
> cache = kmem_cache_create(foo, object_size, ..., SLAB_SEAL);
>
> obj = kmem_cache_alloc(cache, gfp_mask);
> init_obj(obj)
> [more allocations]
> kmem_cache_seal(cache);

In case one doesn't want the feature, at which point would it be disabled?

* not creating the slab
* not sealing it
* something else?

> All slab pages belonging to the cache would get write protection. All
> new allocations from this cache would go to new slab pages. Later
> kmem_cache_seal will write protect only those new pages.

ok

> The main discomfort with this approach is that you have to create those
> caches in advance, obviously. We could help by creating some general
> purpose caches for common sizes but this sound like an overkill to me.
> The caller will know which objects will need the protection so the
> appropriate cache can be created on demand. But this reall depends on
> potential users...

Yes, I provided a more detailed answer in another branch of this thread.
Right now I can answer only for what I have already looked into: SE
Linux policy DB and LSM Hooks, and they do not seem very large.

I do not expect a large footprint, overall, although there might be some
exception.


--
igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-05 10:42   ` Igor Stoppa
@ 2017-05-08 15:25     ` Laura Abbott
  2017-05-09  9:38       ` Igor Stoppa
  2017-05-10  8:05     ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Laura Abbott @ 2017-05-08 15:25 UTC (permalink / raw)
  To: Igor Stoppa, Michal Hocko; +Cc: linux-mm, linux-kernel, kernel-hardening

On 05/05/2017 03:42 AM, Igor Stoppa wrote:
> On 04/05/17 19:49, Laura Abbott wrote:
>> [adding kernel-hardening since I think there would be interest]
> 
> thank you, I overlooked this
> 
> 
>> BPF takes the approach of calling set_memory_ro to mark regions as
>> read only. I'm certainly over simplifying but it sounds like this
>> is mostly a mechanism to have this happen mostly automatically.
>> Can you provide any more details about tradeoffs of the two approaches?
> 
> I am not sure I understand the question ...
> For what I can understand, the bpf is marking as read only something
> that spans across various pages, which is fine.
> The payload to be protected is already organized in such pages.
> 
> But in the case I have in mind, I have various, heterogeneous chunks of
> data, coming from various subsystems, not necessarily page aligned.
> And, even if they were page aligned, most likely they would be far
> smaller than a page, even a 4k page.
> 
> The first problem I see, is how to compact them into pages, ensuring
> that no rwdata manages to infiltrate the range.
> 
> The actual mechanism for marking pages as read only is not relevant at
> this point, if I understand your question correctly, since set_memory_ro
> is walking the pages it receives as parameter.
> 

Thanks for clarifying, this makes sense. I also saw some replies up
thread that also answered some my questions.

>> arm and arm64 have the added complexity of using larger
>> page sizes on the linear map so dynamic mapping/unmapping generally
>> doesn't work. 
> 
> Do you mean that a page could be 16MB and therefore it would not be
> possible to get a smaller chunk?
> 

Roughly yes.

PAGE_SIZE is still 4K/16K/64K but the underlying page table mappings
may use larger mappings (2MB, 32M, 512M, etc.). The ARM architecture
has a break-before-make requirement which requires old mappings be
fully torn down and invalidated to avoid TLB conflicts. This is nearly
impossible to do correctly on live page tables so the current policy
is to not break down larger mappings.

>> arm64 supports DEBUG_PAGEALLOC by mapping with only
>> pages but this is generally only wanted as a debug mechanism.
>> I don't know if you've given this any thought at all.
> 
> Since the beginning I have thought about this feature as an opt-in
> feature. I am aware that it can have drawbacks, but I think it would be
> valuable as debugging tool even where it's not feasible to keep it
> always-on.
> 
> OTOH on certain systems it can be sufficiently appealing to be kept on,
> even if it eats up some more memory.

I'd rather see this designed as being mandatory from the start and then
provide a mechanism to turn it off if necessary. The uptake and
coverage from opt-in features tends to be very low based on past experience.

Thanks,
Laura

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-08 15:25     ` Laura Abbott
@ 2017-05-09  9:38       ` Igor Stoppa
  0 siblings, 0 replies; 23+ messages in thread
From: Igor Stoppa @ 2017-05-09  9:38 UTC (permalink / raw)
  To: Laura Abbott, Michal Hocko; +Cc: linux-mm, linux-kernel, kernel-hardening

On 08/05/17 18:25, Laura Abbott wrote:
> On 05/05/2017 03:42 AM, Igor Stoppa wrote:
>> On 04/05/17 19:49, Laura Abbott wrote:

[...]

> PAGE_SIZE is still 4K/16K/64K but the underlying page table mappings
> may use larger mappings (2MB, 32M, 512M, etc.). The ARM architecture
> has a break-before-make requirement which requires old mappings be
> fully torn down and invalidated to avoid TLB conflicts. This is nearly
> impossible to do correctly on live page tables so the current policy
> is to not break down larger mappings.

ok, but if a system integrator chooses to have the mapping set to 512M
(let's consider a case that is definitely unoptimized), this granularity
will apply to anything that needs to be marked as R/O and is allocated
through linear mapping.

If the issue inherently sits with linear allocation, then maybe the
correct approach is to confirm if linear allocation is really needed, in
the first place.
Maybe not, at least for the type of data I have in mind.

However, that would require changes in the users of the interface,
rather than the interface itself.

I don't see this change as a problem, in general, but OTOH I do not know
yet if there are legitimate reasons to use kmalloc for any
post-init-read-only data.

Of course, if you have any proposal that would solve this problem with
large pages, I'd be interested to know.

Also, for me to better understand the magnitude of the problem, do you
know if there is any specific scenario where larger mappings would be
used/recommended?

The major reason for using larger mapping that I can think of, is to
improve the efficiency of the TLB when under pressure, however I
wouldn't be able to judge how much this can affect the overall
performance a big iron or in a small device.

Do you know if there is any similar case that has to deal with locking
down large pages?
Doesn't the kernel text have potentially a similar risks of leaving
large amount of memory unused?
Is rodata only virtual?

> I'd rather see this designed as being mandatory from the start and then
> provide a mechanism to turn it off if necessary. The uptake and
> coverage from opt-in features tends to be very low based on past experience.

I have nothing against such wish, actually I'd love it, but I'm not sure
I have the answer for it.

Yet, a partial solution is better than nothing, if there is no truly
flexible one.

--
igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-05 12:19           ` Igor Stoppa
@ 2017-05-10  7:45             ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-05-10  7:45 UTC (permalink / raw)
  To: Igor Stoppa; +Cc: linux-mm, linux-kernel, Dave Hansen

On Fri 05-05-17 15:19:19, Igor Stoppa wrote:
> 
> 
> On 04/05/17 17:01, Michal Hocko wrote:
> > On Thu 04-05-17 16:37:55, Igor Stoppa wrote:
> 
> [...]
> 
> >> The disadvantage is that anything can happen, undetected, while the seal
> >> is lifted.
> > 
> > Yes and I think this makes it basically pointless
> 
> ok, this goes a bit beyond what I had in mind initially, but I see your
> point
> 
> [...]
> 
> > Just to make my proposal more clear. I suggest the following workflow
> > 
> > cache = kmem_cache_create(foo, object_size, ..., SLAB_SEAL);
> >
> > obj = kmem_cache_alloc(cache, gfp_mask);
> > init_obj(obj)
> > [more allocations]
> > kmem_cache_seal(cache);
> 
> In case one doesn't want the feature, at which point would it be disabled?
> 
> * not creating the slab
> * not sealing it
> * something else?

If the sealing would be disabled then sealing would be a noop.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-05 10:42   ` Igor Stoppa
  2017-05-08 15:25     ` Laura Abbott
@ 2017-05-10  8:05     ` Michal Hocko
  2017-05-10  8:57       ` Igor Stoppa
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-05-10  8:05 UTC (permalink / raw)
  To: Igor Stoppa; +Cc: Laura Abbott, linux-mm, linux-kernel, kernel-hardening

On Fri 05-05-17 13:42:27, Igor Stoppa wrote:
> On 04/05/17 19:49, Laura Abbott wrote:
> > [adding kernel-hardening since I think there would be interest]
> 
> thank you, I overlooked this
> 
> 
> > BPF takes the approach of calling set_memory_ro to mark regions as
> > read only. I'm certainly over simplifying but it sounds like this
> > is mostly a mechanism to have this happen mostly automatically.
> > Can you provide any more details about tradeoffs of the two approaches?
> 
> I am not sure I understand the question ...
> For what I can understand, the bpf is marking as read only something
> that spans across various pages, which is fine.
> The payload to be protected is already organized in such pages.
> 
> But in the case I have in mind, I have various, heterogeneous chunks of
> data, coming from various subsystems, not necessarily page aligned.
> And, even if they were page aligned, most likely they would be far
> smaller than a page, even a 4k page.

This aspect of various sizes makes the SLAB allocator not optimal
because it operates on caches (pools of pages) which manage objects of
the same size. You could use the maximum size of all objects and waste
some memory but you would have to know this max in advance which would
make this approach less practical. You could create more caches of
course but that still requires to know those sizes in advance.

So it smells like a dedicated allocator which operates on a pool of
pages might be a better option in the end. This depends on what you
expect from the allocator. NUMA awareness? Very effective hotpath? Very
good fragmentation avoidance? CPU cache awareness? Special alignment
requirements? Reasonable free()? Etc...

To me it seems that this being an initialization mostly thingy a simple
allocator which manages a pool of pages (one set of sealed and one for
allocations) and which only appends new objects as they fit to unsealed
pages would be sufficient for starter.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-10  8:05     ` Michal Hocko
@ 2017-05-10  8:57       ` Igor Stoppa
  2017-05-10 11:43         ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Igor Stoppa @ 2017-05-10  8:57 UTC (permalink / raw)
  To: Michal Hocko, Dave Hansen
  Cc: Laura Abbott, linux-mm, linux-kernel, kernel-hardening

On 10/05/17 11:05, Michal Hocko wrote:
> On Fri 05-05-17 13:42:27, Igor Stoppa wrote:

[...]

>> ... in the case I have in mind, I have various, heterogeneous chunks of
>> data, coming from various subsystems, not necessarily page aligned.
>> And, even if they were page aligned, most likely they would be far
>> smaller than a page, even a 4k page.
> 
> This aspect of various sizes makes the SLAB allocator not optimal
> because it operates on caches (pools of pages) which manage objects of
> the same size.

Indeed, that's why I wasn't too excited about caches, but probably I was
not able to explain sufficiently well why in the RFC :-/

> You could use the maximum size of all objects and waste
> some memory but you would have to know this max in advance which would
> make this approach less practical. You could create more caches of
> course but that still requires to know those sizes in advance.

Yes, and even generic per-architecture or even per board profiling
wouldn't necessarily do much good: taking SE Linux as example, one could
have two identical boards with almost identical binaries installed, only
differing in the rules/policy.
That difference alone could easily lead to very different size
requirements for the sealable page pool.

> So it smells like a dedicated allocator which operates on a pool of
> pages might be a better option in the end.

ok

> This depends on what you
> expect from the allocator. NUMA awareness? Very effective hotpath? Very
> good fragmentation avoidance? CPU cache awareness? Special alignment
> requirements? Reasonable free()? Etc...

>From the perspective of selling the feature to as many subsystems as
possible, I'd say that as primary requirement, it shouldn't affect
runtime performance.
I suppose (but it's just my guess) that trading milliseconds-scale
boot-time slowdown, for additional integrity is acceptable in the vast
majority of cases.

The only alignment requirements that I can think of, are coming from the
MMU capability of dealing only with physical pages, when it comes to
write-protect them.

> To me it seems that this being an initialization mostly thingy a simple
> allocator which manages a pool of pages (one set of sealed and one for
> allocations) 

Shouldn't also the set of pages used for keeping track of the others be
sealed? Once one is ro, also the other should not change.

> and which only appends new objects as they fit to unsealed
> pages would be sufficient for starter.

Any "free" that might happen during the initialization transient, would
actually result in an untracked gap, right?

What about the size of the pool of pages?
No predefined size, instead request a new page, when the memory
remaining from the page currently in use is not enough to fit the latest
allocation request?

There are also two aspect we discussed earlier:

- livepatch: how to deal with it? Identify the page it wants to modify
and temporarily un-protect it?

- modules: unloading and reloading modules will eventually lead to
permanently lost pages, in increasing number.
Loading/unloading repeatedly the same module is probably not so common,
with a major exception being USB, where almost anything can show up.
And disappear.
This seems like a major showstopper for the linear allocator you propose.

My reasoning in pursuing the kmalloc approach was that it is already
equipped with mechanisms for dealing with these sort of cases, where
memory can be fragmented.
I also wouldn't risk introducing bugs with my homebrew allocator ...

The initial thought was that there could be a master toggle to
seal/unseal all the memory affected.

But you were not too excited, iirc :-D
Alternatively, kmalloc could be enhanced to unseal only the pages it
wants to modify.

I don't think much can be done for data that is placed together, in the
same page with something that needs to be altered.
But what is outside of that page could still enjoy the protection from
the seal.

--
igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-10  8:57       ` Igor Stoppa
@ 2017-05-10 11:43         ` Michal Hocko
  2017-05-10 15:19           ` Igor Stoppa
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-05-10 11:43 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Dave Hansen, Laura Abbott, linux-mm, linux-kernel, kernel-hardening

On Wed 10-05-17 11:57:42, Igor Stoppa wrote:
> On 10/05/17 11:05, Michal Hocko wrote:
[...]
> > To me it seems that this being an initialization mostly thingy a simple
> > allocator which manages a pool of pages (one set of sealed and one for
> > allocations) 
> 
> Shouldn't also the set of pages used for keeping track of the others be
> sealed? Once one is ro, also the other should not change.

Heh, that really depends how much consistency and robustness you want to
achieve. It is really hard to defend against targeted attacks against
the allocator metadata when a code is running in the kernel.

> > and which only appends new objects as they fit to unsealed
> > pages would be sufficient for starter.
> 
> Any "free" that might happen during the initialization transient, would
> actually result in an untracked gap, right?

yes. And once the whole page is free it would get unsealed and returned
to the (page) allocator. This approach would inevitably lead to internal
fragmentation but reducing that would require a pool which is shared for
objects with the common life cycle which is quite hard with requirements
you have (you would have to convey the allocation context to all users
somehow).

> What about the size of the pool of pages?

I wouldn't see that as a big deal. New pages would be allocated as
needed.

> No predefined size, instead request a new page, when the memory
> remaining from the page currently in use is not enough to fit the latest
> allocation request?

exactly

> There are also two aspect we discussed earlier:
> 
> - livepatch: how to deal with it? Identify the page it wants to modify
> and temporarily un-protect it?

Livepatch doesn't support data structures patching currently and even if
it would have to understand those data structures and do something like
copy&replace...
 
> - modules: unloading and reloading modules will eventually lead to
> permanently lost pages, in increasing number.

Each module should free all objects that were allocated on its behalf
and that should result in pages being freed as well

> Loading/unloading repeatedly the same module is probably not so common,
> with a major exception being USB, where almost anything can show up.
> And disappear.
> This seems like a major showstopper for the linear allocator you propose.

I am not sure I understand. If such a module kept allocations behind it
would be a memory leak no matter what.

> My reasoning in pursuing the kmalloc approach was that it is already
> equipped with mechanisms for dealing with these sort of cases, where
> memory can be fragmented.

Yeah, but kmalloc is optimized for a completely different usecase. You
can reuse same pages again and again while you clearly cannot do the
same once you seal a page and make it read only. Well unless you want to
open time windows when the page stops being RO or use a different
mapping for the allocator.

But try to consider how many features of the slab allocator you are
actually going to need wrt. to tweaks it would have to implement to
support this new use case. Maybe duplicating general purpose caches and
creating specialized explicitly is a viable path. I haven't tried
it.

> I also wouldn't risk introducing bugs with my homebrew allocator ...
> 
> The initial thought was that there could be a master toggle to
> seal/unseal all the memory affected.
> 
> But you were not too excited, iirc :-D

yes, If there are different users a pool (kmem_cache like) would be more
natural.

> Alternatively, kmalloc could be enhanced to unseal only the pages it
> wants to modify.

You would have to stop the world to prevent from an accidental overwrite
during that time. Which makes the whole thing quite dubious IMHO.

> I don't think much can be done for data that is placed together, in the
> same page with something that needs to be altered.
> But what is outside of that page could still enjoy the protection from
> the seal.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-10 11:43         ` Michal Hocko
@ 2017-05-10 15:19           ` Igor Stoppa
  2017-05-10 15:45             ` Dave Hansen
  0 siblings, 1 reply; 23+ messages in thread
From: Igor Stoppa @ 2017-05-10 15:19 UTC (permalink / raw)
  To: Michal Hocko, Dave Hansen, Laura Abbott
  Cc: linux-mm, linux-kernel, kernel-hardening

On 10/05/17 14:43, Michal Hocko wrote:
> On Wed 10-05-17 11:57:42, Igor Stoppa wrote:
>> On 10/05/17 11:05, Michal Hocko wrote:
> [...]
>>> To me it seems that this being an initialization mostly thingy a simple
>>> allocator which manages a pool of pages (one set of sealed and one for
>>> allocations) 
>>
>> Shouldn't also the set of pages used for keeping track of the others be
>> sealed? Once one is ro, also the other should not change.
> 
> Heh, that really depends how much consistency and robustness you want to
> achieve. It is really hard to defend against targeted attacks against
> the allocator metadata when a code is running in the kernel.

Taking the trouble to implement the sealing, then anything that doesn't
have a justification for staying R/W is fair game for sealing, IMHO.

>>> and which only appends new objects as they fit to unsealed
>>> pages would be sufficient for starter.
>>
>> Any "free" that might happen during the initialization transient, would
>> actually result in an untracked gap, right?
> 
> yes. And once the whole page is free it would get unsealed and returned
> to the (page) allocator.

Which means that there must be some way to track the freeing.
I intentionally omitted it, because I wasn't sure it would still be
compatible with the idea of a simple linear allocator.

> This approach would inevitably lead to internal
> fragmentation but reducing that would require a pool which is shared for
> objects with the common life cycle which is quite hard with requirements
> you have (you would have to convey the allocation context to all users
> somehow).

What if the users were unaware of most of the context and would only use
some flag, say GFP_SEAL?
Shouldn't the allocator be the only one aware of the context?
Context being the actual set of pages used.

Other idea: for each logical group of objects having same lifecycle,
define a pool, then do linear allocation within the pool for the
respective logical group.

Still some way would be needed to track the utilization of each page,
but it would ensure that when a logical group is discarded, all its
related pages are freed.

>> What about the size of the pool of pages?
> 
> I wouldn't see that as a big deal. New pages would be allocated as
> needed.

ok

[...]

>> - modules: unloading and reloading modules will eventually lead to
>> permanently lost pages, in increasing number.
> 
> Each module should free all objects that were allocated on its behalf
> and that should result in pages being freed as well

Only if the objects are enforced to be contiguous and the start is at
the beginning of a page, which seems to go in the direction of having a
memory pool for each module.

>> Loading/unloading repeatedly the same module is probably not so common,
>> with a major exception being USB, where almost anything can show up.
>> And disappear.
>> This seems like a major showstopper for the linear allocator you propose.
> 
> I am not sure I understand. If such a module kept allocations behind it
> would be a memory leak no matter what.

What I had in mind is that, with a global linear allocator _without_
support for returning "freed" pages, there would be a memory consumption
progressively increasing.

But even if the module frees correctly its allocations and they are
tracked correctly, it's still possible that some page doesn't get
returned, unless the module had started using data from the beginning of
a brand new page and nothing else but that module used it.

So it really looks like we are discussing a per-module (linear) allocator.

Probably that's what you meant all the time and I just realized it now ...

>> My reasoning in pursuing the kmalloc approach was that it is already
>> equipped with mechanisms for dealing with these sort of cases, where
>> memory can be fragmented.
> 
> Yeah, but kmalloc is optimized for a completely different usecase. You
> can reuse same pages again and again while you clearly cannot do the
> same once you seal a page and make it read only.

No, but during the allocation transient, I could.

Cons: less protection for what is already in the page.
Pros: tighter packing.

> Well unless you want to
> open time windows when the page stops being RO or use a different
> mapping for the allocator.

Yes, I was proposing to temporarily make the specific page RW.

> But try to consider how many features of the slab allocator you are
> actually going to need wrt. to tweaks it would have to implement to
> support this new use case. Maybe duplicating general purpose caches and
> creating specialized explicitly is a viable path. I haven't tried
> it.
> 
>> I also wouldn't risk introducing bugs with my homebrew allocator ...
>>
>> The initial thought was that there could be a master toggle to
>> seal/unseal all the memory affected.
>>
>> But you were not too excited, iirc :-D
> 
> yes, If there are different users a pool (kmem_cache like) would be more
> natural.
> 
>> Alternatively, kmalloc could be enhanced to unseal only the pages it
>> wants to modify.
> 
> You would have to stop the world to prevent from an accidental overwrite
> during that time. Which makes the whole thing quite dubious IMHO.
> 
>> I don't think much can be done for data that is placed together, in the
>> same page with something that needs to be altered.
>> But what is outside of that page could still enjoy the protection from
>> the seal.

Recap
-----
The latest proposal would be (I can create a new version of the RFC if
preferred):

Have a per-module linear memory allocator, using on on-demand new pages,
with some way to track:

* pages used in each pool (ex: a next ptr in each page)
* free space at the end of the page
  the allocation would be aligned accordingly to arch requirements
  implemented, for example, with a counter

Pages are sealed as soon as they fill up and a next one is allocated.
Or they are explicitly sealed by their respective module.

In the typical case the freeing would happen on the entire pool, for
example when the module is unloaded.
It might be even nicer to have a master teardown call, for the whole pool.


There is still the problem  of how to deal with large physical pages and
kmalloc.
Having a per-module pool of pages is likely to generate even more waste,
if the pages are particularly large.

So I'd like to play a little what-if scenario:
what if I was to support exclusively virtual memory and convert to it
everything that might need sealing?

I cannot find any reason why this could not be done, even if the
original code uses kmalloc.

Extension
---------
What I discussed so far is about things that are not expected to change.
At most they would be freed, as units.

However, if some other data exhibits quasi-or-characteristics, it could
be protected as well.

With the understanding that there would be holes in the memory
allocation and a linear allocator probably would not be enough anymore.

This could be achieved by keeping a bitmaps of machine aligned words.

Ex: a 4k page with 8bytes words would need 64 bytes.

--
igor

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-10 15:19           ` Igor Stoppa
@ 2017-05-10 15:45             ` Dave Hansen
  2017-05-19 10:51               ` Igor Stoppa
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2017-05-10 15:45 UTC (permalink / raw)
  To: Igor Stoppa, Michal Hocko, Laura Abbott
  Cc: linux-mm, linux-kernel, kernel-hardening

On 05/10/2017 08:19 AM, Igor Stoppa wrote:
> So I'd like to play a little what-if scenario:
> what if I was to support exclusively virtual memory and convert to it
> everything that might need sealing?

Because of the issues related to fracturing large pages, you might have
had to go this route eventually anyway.  Changing the kernel linear map
isn't nice.

FWIW, you could test this scheme by just converting all the users to
vmalloc() and seeing what breaks.  They'd all end up rounding up all
their allocations to PAGE_SIZE, but that'd be fine for testing.

Could you point out 5 or 10 places in the kernel that you want to convert?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RFC v2: post-init-read-only protection for data allocated dynamically
  2017-05-10 15:45             ` Dave Hansen
@ 2017-05-19 10:51               ` Igor Stoppa
  0 siblings, 0 replies; 23+ messages in thread
From: Igor Stoppa @ 2017-05-19 10:51 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko, Laura Abbott
  Cc: linux-mm, linux-kernel, kernel-hardening

Hello,

On 10/05/17 18:45, Dave Hansen wrote:
> On 05/10/2017 08:19 AM, Igor Stoppa wrote:
>> So I'd like to play a little what-if scenario:
>> what if I was to support exclusively virtual memory and convert to it
>> everything that might need sealing?
> 
> Because of the issues related to fracturing large pages, you might have
> had to go this route eventually anyway.  Changing the kernel linear map
> isn't nice.
> 
> FWIW, you could test this scheme by just converting all the users to
> vmalloc() and seeing what breaks.  They'd all end up rounding up all
> their allocations to PAGE_SIZE, but that'd be fine for testing.

Apologies for the long hiatus, it took me some time to figure out
a solution that could somehow address all the comments I got till this
point.

It's here [1], I preferred to start one new thread, since the proposal
has in practice changed significantly, even if in spirit it's still the
same.

It should also take care of the potential waste of space you mentioned
wrt the round up to PAGE_SIZE.

> Could you point out 5 or 10 places in the kernel that you want to convert?

Right now I can only repeat what I said initially:
- the linked list used to implement LSM hooks
- SE linux structures used to implement the policy DB, it should be
  about 5 data types

Next week, I'll address the 2 cases I listed, then I'll look for more,
but I think it should not be difficult to find customers for this.

BTW, I forgot to mention that I tested the code against both SLAB and
SLUB and it seems to work fine.

So far I've used QEMU x86-64 as test environment.

--
igor


[1] https://marc.info/?l=linux-mm&m=149519044015956&w=2

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-05-19 10:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-03 12:06 RFC v2: post-init-read-only protection for data allocated dynamically Igor Stoppa
     [not found] ` <70a9d4db-f374-de45-413b-65b74c59edcb@intel.com>
2017-05-04  8:17   ` Igor Stoppa
2017-05-04 14:30     ` Dave Hansen
2017-05-05  8:53       ` Igor Stoppa
2017-05-04 11:21 ` Michal Hocko
2017-05-04 12:14   ` Igor Stoppa
2017-05-04 13:11     ` Michal Hocko
2017-05-04 13:37       ` Igor Stoppa
2017-05-04 14:01         ` Michal Hocko
2017-05-04 17:24           ` Dave Hansen
2017-05-05 12:08             ` Igor Stoppa
2017-05-05 12:19           ` Igor Stoppa
2017-05-10  7:45             ` Michal Hocko
2017-05-04 16:49 ` Laura Abbott
2017-05-05 10:42   ` Igor Stoppa
2017-05-08 15:25     ` Laura Abbott
2017-05-09  9:38       ` Igor Stoppa
2017-05-10  8:05     ` Michal Hocko
2017-05-10  8:57       ` Igor Stoppa
2017-05-10 11:43         ` Michal Hocko
2017-05-10 15:19           ` Igor Stoppa
2017-05-10 15:45             ` Dave Hansen
2017-05-19 10:51               ` Igor Stoppa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).