linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
@ 2005-08-26 18:23 Blaisorblade
  2005-08-26 19:11 ` Hugh Dickins
  2005-09-02 21:02 ` Hugh Dickins
  0 siblings, 2 replies; 16+ messages in thread
From: Blaisorblade @ 2005-08-26 18:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Hugh Dickins, Andi Kleen, LKML, Jeff Dike,
	Bodo Stroesser, user-mode-linux-devel

[-- Attachment #1: Type: text/plain, Size: 5522 bytes --]

This is a followup to my post of last week (Aug 12) about remap_file_pages 
protection support. I've improved and consolidated the patches and updated 
them against 2.6.13-rc6/rc7 (the same patches apply against both versions).
I'm sending the full patch series only to akpm, mingo and LKML.

I've also reduced them to only 18, and made the splitting more significant. 
I'm not resending all the patches for foreign architectures, because they're 
almost unchanged since last time (there's just a trivial reject from ppc32, 
because one change has already been done after -rc4).

I'm working on this to provide support for UML, which currently easily creates 
more than 64K (the default limit) vma's for a single process. Actually, it 
needs one VMA per each page. So, with this patch and specific UML support, 
which Ingo wrote and which I'm porting to recent UMLs.

Some highlights:

* The first 2 patches modify the PTE encoding macros and start preparing the 
VM for the new situation (i.e. VMA which have variable protections, which are 
called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name).

Patch number 2 will require fixing up all arches like in 2.6.4-rc2-mm1, to 
provide the new PTE encoding macros.

* Patch 5 allows the syscall to actually create such VMAs. Before that, 
there's no difference in behaviour with the current kernel (except that 
there's less space for file offset encoding in PTEs). And even here, the new 
operations are only enabled for arch explicitly supporting it (see patch #7).

* Patch 8 and 9 change the path for handling page faults, since the permission 
checking on nonuniform vmas cannot be done until the PTE entry has been read.

This is the most intrusive part, but
a) archs are not required to adequate to this immediately
b) it isn't so difficult in practice.

* Patch 11 is a big simplification. Since we must encode the PTE's on swapout 
like in VM_NONLINEAR vmas, the simplest way to reuse the existing code is to 
make sure that VM_NONUNIFORM vmas are also marked as VM_NONLINEAR.

It is possible to avoid this, as in patch #18, but it's just a bit scary, and 

Then there are 4 optimization patches and 3 fixups for some odd cases that we 
maybe won't support. They are namely:
*) vmas with default PROT_NONE protection (I actually feel we're going to 
support this, the only patch which has problems is an optimization)

*) MAP_POPULATE on private VMA (no problem on this) and consequently 
remap_file_pages on private VMA to install linear uniform mappings (since 
MAP_POPULATE is implemented in terms of remap_file_pages): there's a patch to 
stop this from truncating COW pages away, but I don't think it's worth it.

*) linear nonuniform vmas. I initially created them because there's no 
relation between being nonlinear and nonuniform, but it later turned out 
supporting them is intrusive.

I have improved even more the patches, and understood better some changes from 
Ingo which I didn't last time, and fixed their bugs.

I hope these changes can be reviewed, and included inside -mm, even if they'll 
conflict with pagefault scalability patches (even if I think the conflicts 
are not difficult to solve).

Still, the patch is IMHO in better shape, in many ways, than when it was in 
-mm last time. To handle properly all possibilities it has become a bit more 
intrusive.

The original one was designed to handle only the simpler needs of 
UML (an mmap'ing with PROT_NONE followed by nonlinear and nonuniform 
remappings), but it still failed in some cases. I've taken original Ingo's 
test-program and significantly extended it, it's attached to this patch.

I'll appreciate any comments.

==============
Changes from 2.6.5-mm1/dropped version of the patches:
==============
*) Actually implemented _real_ and _anal_ protection support, safe against 
swapout; programs get SIGSEGV *always* when they should. I've used the 
attached test program (an improved version of Ingo's one) to check that.
I tested just until patch 25, onto UML. The subsequent ones are either patches 
for foreign archs or proposed

*) Fixed many changes present in the patches.
*) Fixed UML bits
*) Added some headaches for arches ports. I've also included some patches 
which reduce this.

*) No more usage of a new syscall slot: to use the new interface, application 
will use the new MAP_NOINHERIT flag I've added. I've still the patches to use 
the old -mm ABI, if there's any reason they're needed.

*) Fixed a regression wrt using mprotect() against remapped area (see patch 
15)

======
Changes from my last patch-bomb of the patches:
======
*) fixed mprotect VS remap_file_pages(MAP_NOINHERIT) interaction

*) fixed truncation (with madvise_dontneed or truncate()) of nonuniform but 
linear vmas. Either with patch 11, by removing "nonuniform but linear VMAs", 
or with patch 18.

======
Still todo
======
*) ->populate flushes each TLB individually, instead of using mmu_gathers as 
it should; this was suggested even by Ingo when sending the patch, but it 
seems he didn't get the time to finish this. And I'm now wondering how would 
that relate with I/O... at each I/O point we should finish and regather the 
mmu_gather, as in zap_page_range. But here we are reading pages, not the 
reverse!

Seems rewriting the kernel locking is a quite time-consuming task!
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

[-- Attachment #2: fremap-test-complete.c.bz2 --]
[-- Type: application/x-bzip2, Size: 5377 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-08-26 18:23 [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3 Blaisorblade
@ 2005-08-26 19:11 ` Hugh Dickins
  2005-08-26 19:58   ` [uml-devel] " Blaisorblade
  2005-09-02 21:02 ` Hugh Dickins
  1 sibling, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2005-08-26 19:11 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Andrew Morton, Ingo Molnar, Andi Kleen, LKML, Jeff Dike,
	Bodo Stroesser, user-mode-linux-devel

On Fri, 26 Aug 2005, Blaisorblade wrote:

> This is a followup to my post of last week (Aug 12) about remap_file_pages 
> protection support. I've improved and consolidated the patches and updated 
> them against 2.6.13-rc6/rc7 (the same patches apply against both versions).
> I'm sending the full patch series only to akpm, mingo and LKML.
> 
> I've also reduced them to only 18, and made the splitting more significant. 
> I'm not resending all the patches for foreign architectures, because they're 
> almost unchanged since last time (there's just a trivial reject from ppc32, 
> because one change has already been done after -rc4).
> 
> I'm working on this to provide support for UML, which currently easily creates 
> more than 64K (the default limit) vma's for a single process. Actually, it 
> needs one VMA per each page. So, with this patch and specific UML support, 
> which Ingo wrote and which I'm porting to recent UMLs.

I'll try to take a look sometime next week - or, if I wait until
next Friday, can we expect it to have come down to 9 patches ;-?

I should say, my initial reaction is very much like Andi's last week.

sys_remap_file_pages solves a real problem, but it does so by breaking
lots of rules.  For more than a year after it came in, almost every
development we tried in mm would come up against "but then what do we
do about the nonlinear mappings?".

That has settled down now, but I don't look forward to extending it.
On the other hand, UML does deserve better support.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-08-26 19:11 ` Hugh Dickins
@ 2005-08-26 19:58   ` Blaisorblade
  0 siblings, 0 replies; 16+ messages in thread
From: Blaisorblade @ 2005-08-26 19:58 UTC (permalink / raw)
  To: user-mode-linux-devel
  Cc: Hugh Dickins, Andrew Morton, Ingo Molnar, Andi Kleen, LKML,
	Jeff Dike, Bodo Stroesser

On Friday 26 August 2005 21:11, Hugh Dickins wrote:
> On Fri, 26 Aug 2005, Blaisorblade wrote:
> > This is a followup to my post of last week (Aug 12) about
> > remap_file_pages protection support. I've improved and consolidated the
> > patches and updated them against 2.6.13-rc6/rc7 (the same patches apply
> > against both versions). I'm sending the full patch series only to akpm,
> > mingo and LKML.
> >
> > I've also reduced them to only 18, and made the splitting more
> > significant. I'm not resending all the patches for foreign architectures,
> > because they're almost unchanged since last time (there's just a trivial
> > reject from ppc32, because one change has already been done after -rc4).
> >
> > I'm working on this to provide support for UML, which currently easily
> > creates more than 64K (the default limit) vma's for a single process.
> > Actually, it needs one VMA per each page. So, with this patch and
> > specific UML support, which Ingo wrote and which I'm porting to recent
> > UMLs.
>
> I'll try to take a look sometime next week - or, if I wait until
> next Friday, can we expect it to have come down to 9 patches ;-?
:-) Don't think so, unless you want just me to join patches together. However, 
there are still some oneliners, so the patchset is not so huge.

Well, diffstat seems to contradict me (on the joined-up patch):

 Documentation/feature-removal-schedule.txt |   12 +
 arch/i386/mm/fault.c                       |   19 ++
 arch/um/kernel/trap_kern.c                 |   52 ++++++-
 arch/x86_64/mm/fault.c                     |    6
 include/asm-i386/mman.h                    |    1
 include/asm-i386/pgtable-2level.h          |   15 +-
 include/asm-i386/pgtable-3level.h          |   11 +
 include/asm-i386/pgtable.h                 |    3
 include/asm-ia64/mman.h                    |    1
 include/asm-ppc/mman.h                     |    1
 include/asm-ppc64/mman.h                   |    1
 include/asm-s390/mman.h                    |    1
 include/asm-um/pgtable-2level.h            |   15 +-
 include/asm-um/pgtable-3level.h            |   21 ++-
 include/asm-um/pgtable.h                   |    3
 include/asm-x86_64/mman.h                  |    1
 include/asm-x86_64/pgtable.h               |   12 +
 include/linux/mm.h                         |   40 ++++--
 include/linux/pagemap.h                    |   32 ++++
 mm/filemap.c                               |   18 ++
 mm/fremap.c                                |  135 +++++++++++++++-----
 mm/madvise.c                               |    2
 mm/memory.c                                |  193 
++++++++++++++++++++++-------
 mm/mmap.c                                  |   14 +-
 mm/mprotect.c                              |    3
 mm/rmap.c                                  |    6
 mm/shmem.c                                 |   13 +

> I should say, my initial reaction is very much like Andi's last week.

> sys_remap_file_pages solves a real problem, but it does so by breaking
> lots of rules.  For more than a year after it came in, almost every
> development we tried in mm would come up against "but then what do we
> do about the nonlinear mappings?".

Nonuniform mappings are much less of a problem. Really. The reason nonlinear 
mappings reached mainline before nonuniform ones (and I don't know if they 
willl ever) is not simplicity, but the miss of uses until now. And also the 
fact that Ingo hadn't the time to finish it.

In fact, I've been playing a lot with the GIT history during this month of 
development, particularly with objrmap, so I've come across those problems 
quite a lot, but what I noticed is that you mostly don't care about the VMA 
to be uniform, just it to be linear or not (because nonlinear VMAs break 
objrmap).

This patch, in comparison, is just:

*) check permissions in the generic VM when faulting in pages, if and only if 
the vma is nonuniform (yes, nontrivial at all).

*) be anally picky to save the PTE protections together with the rest, and do 
it *everywhere*; but if you say "nonuniform implies linear", you lose this 
problem almost completely.

The only exception is that when you remap an address range with PROT_NONE 
(thus effectively unmapping those addressed), you can't clear the PTEs but 
you must use pfn_pte(0, __S000) to fill them in (this is done in the 
optimization-fixup patch #15).

> That has settled down now, but I don't look forward to extending it.
> On the other hand, UML does deserve better support.

> Hugh

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-08-26 18:23 [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3 Blaisorblade
  2005-08-26 19:11 ` Hugh Dickins
@ 2005-09-02 21:02 ` Hugh Dickins
  2005-09-04 19:10   ` [uml-devel] " Blaisorblade
  1 sibling, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2005-09-02 21:02 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Andrew Morton, Ingo Molnar, Andi Kleen, LKML, Jeff Dike,
	Bodo Stroesser, user-mode-linux-devel

On Fri, 26 Aug 2005, Blaisorblade wrote:
> 
> * The first 2 patches modify the PTE encoding macros and start preparing the 
> VM for the new situation (i.e. VMA which have variable protections, which are 
> called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name).

What a pity: please revert.  VM_NONUNIFORM sounds impressive, but might
mean all kinds of things, maybe to do with NUMA.  VM_MANYPROTS is good,
it says what it means.

> * Patch 11 is a big simplification. Since we must encode the PTE's on swapout 
> like in VM_NONLINEAR vmas, the simplest way to reuse the existing code is to 
> make sure that VM_NONUNIFORM vmas are also marked as VM_NONLINEAR.

In some places you seem to say that you (UML) only need VM_MANYPROTS vmas
linear, in other places you seem to say that your VM_MANYPROTS vmas will
be nonlinear.  I've no idea which way round it is.  Perhaps the "non"
sometimes goes missing (another reason to avoid NONUNIFORM).

I wrote that yesterday.  Thanks, you've cleared it up in private mail:
the VM_MANYPROTS vmas that UML wants are VM_NONLINEAR anyway.

Yes, I see your dilemma there: you rightly want to avoid adding bloat
by distinguishing cases that you don't need distinguished; but equally
rightly fear that someone somewhere will start using the VM_MANYPROTS
for other reasons, and hit the inefficiencies of VM_NONLINEAR
unnecessarily.  I share your uncertainty, I don't have an immediate
feel for the right direction on that.

> *) No more usage of a new syscall slot: to use the new interface, application 
> will use the new MAP_NOINHERIT flag I've added. I've still the patches to use 
> the old -mm ABI, if there's any reason they're needed.

I'm glad you've scrapped the new syscall slot, that really put me off
the old patch (though I was probably being silly about it).  This way
is much better, but again I quarrel with your naming.

"Inherit" is about parents and children, this is not; and furthermore,
some UNIXes had a MAP_INHERIT (see asm-alpha/mman.h) which was about
passing an mmap across exec.  Your MAP_NOINHERIT has nothing to do
with that.  MAP_MANYPROTS would help us to follow the trail more
easily (though it's true that you can't actually pass many prots
in to a single remap_file_pages call).

> Subject: [patch 01/18] remap_file_pages protection support: uml, i386, x64 bits
> 
> Update pte encoding macros for UML, i386 and x86-64.
> Also, add the MAP_NOINHERIT flag to arch headers.

Well, I don't find your patch division very helpful, since you introduce
these without us seeing what use is made of them.  And the MAP_NOINHERIT
additions cover a different subset of arches (ppc, ppc64, s390 in there):
those should be in some other patch.

Usually we just do the i386 arch first, and supply some other patch(es)
for all the others.  But you've good reason to start with UML too, and
it makes sense to include x86_64 along too if you're happy to do so.

But it'll probably waste your time and mine to go on discussng patch
division, let's leave it at that.

> *** remap_file_pages protection support: improvement for UML bits
> 
> Recover one bit by additionally using _PAGE_NEWPROT. Since I wasn't sure this
> would work, I've split this out, but it has worked well. We rely on the fact
> that pte_newprot always checks first if the PTE is marked present.

And we never hear of _PAGE_NEWPROT or pte_newprot again.  Ah, they're
already defined in and peculiar to UML, I see.  Well, if this some
UML improvement change, please put that in a separate UML patch.

> -#define pte_to_pgoff(pte) (pte_val(pte) >> 4)
> +#define pte_to_pgoff(pte) (((pte_val(pte) >> 6) << 1) | ((pte_val(pte) >> 2) & 0x1))
> +#define pte_to_pgprot(pte) \
> +	__pgprot((pte_val(pte) & (_PAGE_RW | _PAGE_PROTNONE)) \
> +		| ((pte_val(pte) & _PAGE_PROTNONE) ? 0 : \
> +			(_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED)

It took me a long time to get past this!  But it's not your fault
at all, you are just tweaking what's there.  In the end I decided
I'd do better to take it on trust and move along.

(I realize that PROTNONE is a specially awkward case, but I keep
feeling that we make it even more awkward than it need be.  But
again, if so, that's not your fault.  And I've never been forced
to think it through here, as you have.)

> -#define PTE_FILE_MAX_BITS	29
> +#define PTE_FILE_MAX_BITS	28

That being the i386 2-level case.  I think that one less bit there is
probably okay, since wli changed the 3-level case to use all 32 bits
of the high word.  The people needing 29 file offset bits are probably
already using PAE, and can be directed to it if not.

But there may be other 32-bit arches, without a 3-level alternative,
which are impacted.  (I don't think I'm telling you anything you
don't already know, just highlighting a point for others to comment.)

> +#define MAP_NOINHERIT	0x20000		/* don't inherit the protection bits of the underlying vma*/

As above, MAP_MANYPROTS or something better please.  And please make it
clear in the comment that it's a flag to remap_file_pages and nothing else.

> Subject: [patch 02/18] remap_file_pages protection support: handle nonuniform VMAs
> 
> * (TODO?) avoid regression in max file offset with r_f_p() for older mappings;
>   we could use either the old offset encoding or the new offset-prot encoding
>   depending on this flag.
>   It's trivial to do, just I don't know whether existing apps will overflow
>   the new limits. They go down from 2Tb to 1Tb on i386 and 512G on PPC, and
>   from 256G to 128G on S390/31 bits. Give me a call in case.

Spare me!  That would make those macros even worse.
I hope noone will demand it.

> mprotect alters the VMA prots and walks each present PTE, ignoring installed
> ones; their saved prots will be restored on faults, ignoring VMA ones and
> losing the mprotect() on them. So, we must restore VMA prots when the VMA is
> uniform, as we used to do.

If I understand it, you're here commenting on the history of these
remap_file_pages patches.  That's okay in the 0/18 introducing
a new version, but please, in the comment on the patch itself,
explain what it's doing, not how you've been changing it around.
Oh, mm/mprotect.c doesn't even come in this patch, you're writing
about the next one.

> +	pgprot = vma->vm_flags & VM_NONUNIFORM ? pte_to_pgprot(*pte): vma->vm_page_prot;

Personally I find "(vma->vm_flags & VM_NONUNIFORM)" more reassuring there.

> Subject: [patch 03/18] remap_file_pages protection support: make mprotect skip pagetables on nonuniform
> 
> There is IMHO no reason to support using mprotect on non-uniform VMAs. The
> only exception is to change the VMA's default protection (which is used for
> non-individually remapped pages), but it must still ignore the page tables, as
> done in this patch.
> 
> The only unsatisfied need is if I want to change protections without changing
> the indexes, which with remap_file_pages you must do one page at a time and
> re-specifying the indexes.
> 
> It is more reasonable to allow remap_file_pages to change protections on a PTE
> range without changing the offsets. I've not implemented this, but if wanted I
> can. For sure, UML doesn't currently need this interface.
> 
> However, for now I've implemented only this change to mprotect(), I'd like to
> get some feedback about this choice.

No, I think I disagree with your choice here.  A reasonable person,
doing a successful, prohibitive (e.g. remove write access) mprotect
on a range, would expect those prohibitions then to be enforced across
the range.  Whereas you're letting existing entries (whether present
or paged out) retain the permissions they were given earlier with
remap_file_pages.

I think mprotect should remove the VM_MANYPROTS attribute of each
vma in the range, and give all the present entries the new pgprot
(in the same way that it normally does).

Or, if that would really bloat the implementation (I don't see why,
but perhaps the encoding of absent entries, maybe manyprots, maybe
nonlinear, might do so), just fail with -EACCES when sys_mprotect
meets a manyprots vma, until someone asks for better.

As to letting remap_file_pages change permissions without changing
file offsets somehow (another MAP_ flag I suppose), yes, it could
be done but don't bother until/unless there is such need.

> [PATCH] remove implied vm_ops check

This one was missing from the sequence, and you supplied privately.
Remove implied vm_ops check??  No, that belongs to something else.

> Even now, we'll sometimes take the write lock, and maybe go to sleep with it
> for I/O.

You're writing about sys_remap_file_pages.

> So, in that case, we could downgrade it; after a tiny bit of thought, I've
> choosen doing that when we'll either do any I/O or we'll alter a lot of PTEs.
> About how much "a lot" is, I've copied the values from this code in
> mm/memory.c, but note they're about mm->page_table_lock, a spinlock, not a
> semaphore:
> 
> #ifdef CONFIG_PREEMPT
> # define ZAP_BLOCK_SIZE	(8 * PAGE_SIZE)
> #else
> /* No preempt: go for improved straight-line efficiency */
> # define ZAP_BLOCK_SIZE	(1024 * PAGE_SIZE)
> #endif
> 
> I'm not sure about the trade-offs - we used to have a down_write, now we have
> a down_read() and a possible up_read()-down_write(), and with this patch, the
> fast-path still takes only down_read, but the slow path will do down_read(),
> up_read(), down_write(), downgrade_write(). This will increase by one the
> number of atomic operation but increase concurrency wrt mmap and similar
> operations - I don't know how much contention there is on that lock.

Please drop all this, it's overengineered.  A semaphore is not a spinlock,
it's not adding to the latency of the system as a whole, just preventing
concurrent page faults on that mm (if you downgrade_write, you're still
excluding concurrent mmaps, mprotects etc., and necessarily so).

Originally every sys_remap_file_pages did down_write, and there was a
downgrade_write before the populate.  Doing down_write every time did
hurt, so it was changed only to do that when first going nonlinear
(to respect the locking on vm_flags), and we didn't bother to downgrade
in that one instance.

I don't mind you just adding a simple
		if (has_write_lock)
			downgrade_write(&mm->mmap_sem);
before the populate (and getting rid of the has_write_lock condition
further down), but anything more seems overkill to me.

> Also, drop a bust comment: we cannot clear VM_NONLINEAR simply because code
> elsewhere is going to use it. At the very least, madvise_dontneed() relies on
> that flag being set (remaining non-linear truncation read the mapping
> list), but the list is probably longer, and going to increase.

You misunderstand the comment (it could indeed be clearer), but you're
right that the bit about downgrading the lock is out-of-date.
I suggest you insert this comment instead.
		/*
		 * We would like to clear VM_NONLINEAR, in the case when
		 * sys_remap_file_pages covers the whole vma, so making
		 * it linear again.  But cannot do so until after a
		 * successful populate, and have no way to upgrade sem.
		 */

> Subject: [patch 04/18] remap_file_pages protection support: cleanup syscall checks
> 
> This patch reorganizes the code only, without differences in behaviour. It
> makes the code more readable on its own, and is needed for next patches. I've
> split this out to avoid cluttering real patches.

Okay, you're breaking up conditions nicely, but please break this one up too
>  
> +	if (!vma->vm_ops || !vma->vm_ops->populate || end <= start || start <
> +			vma->vm_start || end > vma->vm_end)
> +		goto out_unlock;

into a vm_ops part and a start/end part.

> Subject: [patch 05/18] remap_file_pages protection support: enhance syscall interface
> 
> @@ -203,7 +203,7 @@ asmlinkage long sys_remap_file_pages(uns
>  	/* Can we represent this offset inside this architecture's pte's? */
>  #if PTE_FILE_MAX_BITS < BITS_PER_LONG
>  	if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS))
> -		return err;
> +		return -EOVERFLOW;
>  #endif

In the last patch you replaced all(?) the early returns by gotos;
now in this one you reintroduce an early return.  Better be consistent
and make this one a goto too.

> +	if (flags & MAP_NOINHERIT) {
> +		err = -EPERM;
> +		if (((prot & PROT_READ) && !(vma->vm_flags & VM_MAYREAD)))
> +			goto out_unlock;
> +		if (((prot & PROT_WRITE) && !(vma->vm_flags & VM_MAYWRITE)))
> +			goto out_unlock;
> +		if (((prot & PROT_EXEC) && !(vma->vm_flags & VM_MAYEXEC)))
> +			goto out_unlock;
> +		err = -EINVAL;
> +		pgprot = protection_map[calc_vm_prot_bits(prot) | VM_SHARED];

Please follow mprotect's way of doing this, calc_vm_prot_bits first then
		if ((newflags & ~(newflags >> 4)) & 0xf) {
Except please do NOT say 0xf!  Use VM_READ|VM_WRITE|VM_EXEC instead
(yeah, yeah, I know that only comes to 7).

> Subject: [patch 06/18] remap_file_pages protection support: support private vma for MAP_POPULATE
> 
> Fix MAP_POPULATE | MAP_PRIVATE. We don't need the VMA to be shared if we don't
> rearrange pages around. And it's trivial to do.

This seems reasonable to me.  I was afraid you were going to extend
VM_NONLINEAR to private maps, but no, you're just letting private maps
be populated linearly, that's fine.  (Really we ought then also to
allow anonymous maps be pre-populated also, but that's a different
patch, which wouldn't touch sys_remap_file_pages at all: forget it.)

> --- linux-2.6.git/mm/mmap.c~rfp-private-vma-2	2005-08-24 20:57:13.000000000 +0200
> +++ linux-2.6.git-paolo/mm/mmap.c	2005-08-24 20:57:13.000000000 +0200
> @@ -1124,6 +1124,10 @@ out:	
>  	}
>  	if (flags & MAP_POPULATE) {
>  		up_write(&mm->mmap_sem);
> +		/*
> +		 * remap_file_pages() works even if the mapping is private,
> +		 * in the linearly-mapped case:
> +		 */

But that's one of your historical comments.  The point of the patch is
that it's peculiar not to populate the private mappings, so why comment
when we're no longer doing something peculiar?  Please leave it out.

>  		sys_remap_file_pages(addr, len, 0,
>  					pgoff, flags & MAP_NONBLOCK);
>  		down_write(&mm->mmap_sem);

> Subject: [patch 07/18] remap_file_pages protection support: safety net for lazy arches
> 
> Since proper support requires that the arch at the very least handles
> VM_FAULT_SIGSEGV, as in next patch (otherwise the arch may BUG), and things
> are even more complex (see next patches), and it's triggerable only with
> VM_NONUNIFORM vma's, simply refuse creating them if the arch doesn't declare
> itself ready.
> 
> This is a very temporary hack, so I've clearly marked it as such. And, with
> current rythms, I've given about 6 months for arches to get ready. Reducing
> this time is perfectly ok for me.
> 
> +#ifndef __ARCH_SUPPORTS_VM_NONUNIFORM
> +	if (flags & MAP_NOINHERIT)
> +		goto out;
> +#endif

Any arch that doesn't have a definition of MAP_MANYPROTS in its asm/mman.h
has its build broken anyways.  While you're giving them all MAP_MANYPROTS
definitions, please just modify their fault handlers, as simply as possible.
So skip the __ARCH_SUPPORTS_VM_NONUNIFORM business.

>  
> +++ linux-2.6.git-paolo/Documentation/feature-removal-schedule.txt	2005-08-24 20:57:18.000000000 +0200
> +
> +---------------------------
> +
> +What:	__ARCH_SUPPORTS_VM_NONUNIFORM
> +When:	December 2005
> +Files:	mm/fremap.c, include/asm-*/pgtable.h
> +Why:	It's just there to allow arches to update their page fault handlers to
> +	support VM_FAULT_SIGSEGV, for remap_file_pages protection support.
> +	Since they may BUG if this support is not added, the syscall code
> +	refuses this new operation mode unless the arch declares itself as
> +	"VM_FAULT_SIGSEGV-aware" with this macro.
> +Who:	Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

This is very thorough and well-intentioned, but skip it - we do that
for old features from which many drivers need converting, but it's
not like that with the architectures.  Just supply the simple patches.

> Subject: [patch 08/18] remap file pages protection support: use FAULT_SIGSEGV for protection checking

Ah ha, now we get to the heart of it.  (And nothing wrong with such a lead
up, I too like a number of patches to set the stage before the big one.)

Well, this is two patches.  First there's handling VM_FAULT_SIGSEGV in
various arches.  Please minimize those patches for now: VM_FAULT_SIGSEGV
is very sensible in itself, and with its help we might be able to move
more code from arches to common in future; but for now, try to keep it
to adding the 
		case VM_FAULT_SIGSEGV:
			goto bad_area;
in each arch fault handler.  Resist changing "survive" to "handle_fault".
Resist adding "access_mask" and passing it in place of "write": that may
be necessary later to handle VM_EXEC properly on a few architectures,
but I suggest (particularly given the subset of arches of interest)
that you start with just the read/write distinction, and if exec needs
more add it later (even if it's just a later patch of the same series).

And resist adding a preliminary VM_MANYPROTS check to each fault
handler.  Let's say instead that the mmap/mprotect permissions are the
maximum that the area can take (and therefore modify the check done on
prots in sys_remap_file_pages, to make sure they do not exceed "default").

Would that work out?

And then there's the core patch, to mm/memory.c.  But I'm afraid at this
point, just when it gets interesting, my time and my patience run out.

I still haven't come to a conclusion on the most interesting lines of
all, do_no_page's
		if (write_access || unlikely(vma->vm_flags & VM_NONUNIFORM))
			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
but I suspect you're wrong to be avoiding do_wp_page at all costs,
should instead be adapting do_wp_page to do the right thing with the
faults which arrive there.

And I think there's a serious flaw in handle_pte_fault, where it says
> +	/* VM_NONUNIFORM vma's have PTE's always installed with the correct
> +	 * protection. So, generate a SIGSEGV if a fault is caught there. */
> +	if (unlikely(vma->vm_flags & VM_NONUNIFORM))
> +		goto out_segv;

Consider two threads faulting on the same pte on different cpus.
One of them fixes it up, you're giving the other SIGSEGV?
I think this runs quite deep and will need a rework.

Sorry, I do not know when I can next face going over this,
it's a hard task for me: perhaps someone else can take over.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-09-02 21:02 ` Hugh Dickins
@ 2005-09-04 19:10   ` Blaisorblade
  2005-09-07 12:00     ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Blaisorblade @ 2005-09-04 19:10 UTC (permalink / raw)
  To: user-mode-linux-devel
  Cc: Hugh Dickins, Andrew Morton, Ingo Molnar, Andi Kleen, LKML,
	Jeff Dike, Bodo Stroesser

On Friday 02 September 2005 23:02, Hugh Dickins wrote:
> On Fri, 26 Aug 2005, Blaisorblade wrote:
> > * The first 2 patches modify the PTE encoding macros and start preparing
> > the VM for the new situation (i.e. VMA which have variable protections,
> > which are called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name).

> What a pity: please revert.  VM_NONUNIFORM sounds impressive, but might
> mean all kinds of things, maybe to do with NUMA.  VM_MANYPROTS is good,
> it says what it means.
Ok. Btw, before I forget: I assume I should redo the patches rather than fix 
what you say on top of mine, (at least when not changing behaviour), right?
> > * Patch 11 is a big simplification. Since we must encode the PTE's on
> > swapout like in VM_NONLINEAR vmas, the simplest way to reuse the existing
> > code is to make sure that VM_NONUNIFORM vmas are also marked as
> > VM_NONLINEAR.

> In some places you seem to say that you (UML) only need VM_MANYPROTS vmas
> linear, in other places you seem to say that your VM_MANYPROTS vmas will
> be nonlinear.  I've no idea which way round it is.  Perhaps the "non"
> sometimes goes missing (another reason to avoid NONUNIFORM).

> I wrote that yesterday.  Thanks, you've cleared it up in private mail:
> the VM_MANYPROTS vmas that UML wants are VM_NONLINEAR anyway.

> Yes, I see your dilemma there: you rightly want to avoid adding bloat
> by distinguishing cases that you don't need distinguished; but equally
> rightly fear that someone somewhere will start using the VM_MANYPROTS
> for other reasons, and hit the inefficiencies of VM_NONLINEAR
> unnecessarily.  I share your uncertainty, I don't have an immediate
> feel for the right direction on that.
We'll see later if we can cater to this case without messing up zap_pte_range 
as I did in last patch (that is the only one with which I was able to break 
something - not in the version I sent, however).
> > *) No more usage of a new syscall slot: to use the new interface,
> > application will use the new MAP_NOINHERIT flag I've added. I've still
> > the patches to use the old -mm ABI, if there's any reason they're needed.

> I'm glad you've scrapped the new syscall slot, that really put me off
> the old patch (though I was probably being silly about it).  This way
> is much better, but again I quarrel with your naming.

> "Inherit" is about parents and children, this is not; and furthermore,
> some UNIXes had a MAP_INHERIT (see asm-alpha/mman.h) which was about
> passing an mmap across exec.  Your MAP_NOINHERIT has nothing to do
> with that.  MAP_MANYPROTS would help us to follow the trail more
> easily (though it's true that you can't actually pass many prots
> in to a single remap_file_pages call).
MAP_CHGPROT? MAP_CHANGEPROT? MAP_REPROT?
VM_MANYPROTS is internal name, so there's no reason to have the same name 
either.
> > Subject: [patch 01/18] remap_file_pages protection support: uml, i386,
> > x64 bits
> >
> > Update pte encoding macros for UML, i386 and x86-64.
> > Also, add the MAP_NOINHERIT flag to arch headers.
>
> Well, I don't find your patch division very helpful, since you introduce
> these without us seeing what use is made of them.  And the MAP_NOINHERIT
> additions cover a different subset of arches (ppc, ppc64, s390 in there):
> those should be in some other patch.
For this patch, I joined up everything because people get scared when they see 
39 patches (and I've not really removed code, apart for things which were 
introduced and later rewritten, just changed the presentation between the two 
sends).
> Usually we just do the i386 arch first, and supply some other patch(es)
> for all the others.  But you've good reason to start with UML too, and
> it makes sense to include x86_64 along too if you're happy to do so.

> But it'll probably waste your time and mine to go on discussng patch
> division, let's leave it at that.

> > *** remap_file_pages protection support: improvement for UML bits

> > Recover one bit by additionally using _PAGE_NEWPROT. Since I wasn't sure
> > this would work, I've split this out, but it has worked well. We rely on
> > the fact that pte_newprot always checks first if the PTE is marked
> > present.

> And we never hear of _PAGE_NEWPROT or pte_newprot again.  Ah, they're
> already defined in and peculiar to UML, I see.  Well, if this some
> UML improvement change, please put that in a separate UML patch.
As above, I joined altogether more patches to reduce noise. And after proper 
unit testing and checking, it was safe anyway to join it.

> > -#define pte_to_pgoff(pte) (pte_val(pte) >> 4)
> > +#define pte_to_pgoff(pte) (((pte_val(pte) >> 6) << 1) | ((pte_val(pte)
> > >> 2) & 0x1)) +#define pte_to_pgprot(pte) \
> > +	__pgprot((pte_val(pte) & (_PAGE_RW | _PAGE_PROTNONE)) \
> > +		| ((pte_val(pte) & _PAGE_PROTNONE) ? 0 : \
> > +			(_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED)

> It took me a long time to get past this!  But it's not your fault
> at all, you are just tweaking what's there.  In the end I decided
> I'd do better to take it on trust and move along.

> (I realize that PROTNONE is a specially awkward case, but I keep
> feeling that we make it even more awkward than it need be.  But
> again, if so, that's not your fault.  And I've never been forced
> to think it through here, as you have.)
Actually I first just included this piece from Ingo, while changing shifts - 
one week later I was astonished about where we put _PAGE_USER, until 
realizing it's not read from the PTE but synthetized above.

I should put a comment there, anyway.
> > -#define PTE_FILE_MAX_BITS	29
> > +#define PTE_FILE_MAX_BITS	28
>
> That being the i386 2-level case.  I think that one less bit there is
> probably okay, since wli changed the 3-level case to use all 32 bits
> of the high word.  The people needing 29 file offset bits are probably
> already using PAE, and can be directed to it if not.

> But there may be other 32-bit arches, without a 3-level alternative,
> which are impacted.  (I don't think I'm telling you anything you
> don't already know, just highlighting a point for others to comment.)
s390/31 bit would suffer even more (they go 26->25). But on this, it can be 
avoided (for who doesn't use the new API) as explained below.

> > +#define MAP_NOINHERIT	0x20000		/* don't inherit the protection bits of
> > the underlying vma*/

> As above, MAP_MANYPROTS or something better please.  And please make it
> clear in the comment that it's a flag to remap_file_pages and nothing else.
Ok.
> > Subject: [patch 02/18] remap_file_pages protection support: handle
> > nonuniform VMAs

> > * (TODO?) avoid regression in max file offset with r_f_p() for older
> > mappings; we could use either the old offset encoding or the new
> > offset-prot encoding depending on this flag.
> >   It's trivial to do, just I don't know whether existing apps will
> > overflow the new limits. They go down from 2Tb to 1Tb on i386 and 512G on
> > PPC, and from 256G to 128G on S390/31 bits. Give me a call in case.

> Spare me!  That would make those macros even worse.
> I hope noone will demand it.
It's just what you remarked above. But we'd have separate macros and code 
paths (not hugely separate): use the old macro version if VM_MANYPROTS clear, 
use the new one if VM_MANYPROTS set.
> > mprotect alters the VMA prots and walks each present PTE, ignoring
> > installed ones; their saved prots will be restored on faults, ignoring
> > VMA ones and losing the mprotect() on them. So, we must restore VMA prots
> > when the VMA is uniform, as we used to do.

> If I understand it, you're here commenting on the history of these
> remap_file_pages patches.  That's okay in the 0/18 introducing
> a new version, but please, in the comment on the patch itself,
> explain what it's doing, not how you've been changing it around.
> Oh, mm/mprotect.c doesn't even come in this patch, you're writing
> about the next one.
"We used to do" refers to before the patches themselves, i.e. with current 
r_f_p() code.
> > +	pgprot = vma->vm_flags & VM_NONUNIFORM ? pte_to_pgprot(*pte):
> > vma->vm_page_prot;
>
> Personally I find "(vma->vm_flags & VM_NONUNIFORM)" more reassuring there.
Ok.
> > Subject: [patch 03/18] remap_file_pages protection support: make mprotect
> > skip pagetables on nonuniform

> > There is IMHO no reason to support using mprotect on non-uniform VMAs.
> > The only exception is to change the VMA's default protection (which is
> > used for non-individually remapped pages), but it must still ignore the
> > page tables, as done in this patch.

> > The only unsatisfied need is if I want to change protections without
> > changing the indexes, which with remap_file_pages you must do one page at
> > a time and re-specifying the indexes.

> > It is more reasonable to allow remap_file_pages to change protections on
> > a PTE range without changing the offsets. I've not implemented this, but
> > if wanted I can. For sure, UML doesn't currently need this interface.

> > However, for now I've implemented only this change to mprotect(), I'd
> > like to get some feedback about this choice.

> No, I think I disagree with your choice here.  A reasonable person,
> doing a successful, prohibitive (e.g. remove write access) mprotect
> on a range, would expect those prohibitions then to be enforced across
> the range.  Whereas you're letting existing entries (whether present
> or paged out) retain the permissions they were given earlier with
> remap_file_pages.
A reasonable person using VM_MANYPROTS must study the new API anyway - and if 
he went even to the trouble of writing twice the code to support even older 
kernels (UML does), and he wants to prohibit access to a range, he'd rather 
go with remap_file_pages (PROT_NONE), which does exactly what you expect, or 
it would waste the advantage of avoiding VMA splitting.

My point is that after I set some PTEs individually with RFP, while some 
others are installed by do_no_page() with the default VMA prot, how can the 
kernel distinguish between them? Because (say) I don't want to change the 
ones I installed separately. Should the kernel cater to this need?

However, changing the default VMA prot is very reasonable. And with the way I 
chose the user is given more flexibility.

Random example: a profiler/debugger lets the app run for a while and fault in 
some pages, on a nonuniform vma, and/or maps some ones by hand, and then 
wants to protect the rest without touching present ones - it'd use 
mprotect(), but if that's done your way it becomes impossible.

Don't know if this will ever be useful, but that's why the API is more 
powerful.

However without that new "remap range changing permissions without changing 
offset" we have a problem on the other side.

So, since we have no use for it currently, I'll choose -EINVAL.
> I think mprotect should remove the VM_MANYPROTS attribute of each
> vma in the range, and give all the present entries the new pgprot
> (in the same way that it normally does).

> Or, if that would really bloat the implementation (I don't see why,
> but perhaps the encoding of absent entries, maybe manyprots, maybe
> nonlinear, might do so), just fail with -EACCES when sys_mprotect
> meets a manyprots vma, until someone asks for better.
That is reasonable; -EINVAL makes more sense for me in this case (why 
permission denied?)
> As to letting remap_file_pages change permissions without changing
> file offsets somehow (another MAP_ flag I suppose), yes, it could
> be done but don't bother until/unless there is such need.
Agreed.
> > [PATCH] remove implied vm_ops check

> This one was missing from the sequence, and you supplied privately.
> Remove implied vm_ops check??  No, that belongs to something else.
Ahr - sorry, that's an old bug of my SCM scripts.
> > Even now, we'll sometimes take the write lock, and maybe go to sleep with
> > it for I/O.

> You're writing about sys_remap_file_pages.
Yes, the real title is more meaningful.
> > So, in that case, we could downgrade it; after a tiny bit of thought,
> > I've choosen doing that when we'll either do any I/O or we'll alter a lot
> > of PTEs. About how much "a lot" is, I've copied the values from this code
> > in mm/memory.c, but note they're about mm->page_table_lock, a spinlock,
> > not a semaphore:
> >
> > #ifdef CONFIG_PREEMPT
> > # define ZAP_BLOCK_SIZE	(8 * PAGE_SIZE)
> > #else
> > /* No preempt: go for improved straight-line efficiency */
> > # define ZAP_BLOCK_SIZE	(1024 * PAGE_SIZE)
> > #endif

> > I'm not sure about the trade-offs - we used to have a down_write, now we
> > have a down_read() and a possible up_read()-down_write(), and with this
> > patch, the fast-path still takes only down_read, but the slow path will
> > do down_read(), up_read(), down_write(), downgrade_write(). This will
> > increase by one the number of atomic operation but increase concurrency
> > wrt mmap and similar operations - I don't know how much contention there
> > is on that lock.

> Please drop all this, it's overengineered.  A semaphore is not a spinlock,
> it's not adding to the latency of the system as a whole, just preventing
> concurrent page faults on that mm
Don't know if this is casual - but have you anything personal against 
Christoph Lameter ;-)?
> (if you downgrade_write, you're still 
> excluding concurrent mmaps, mprotects etc., and necessarily so).
Yes, correct, I missed that.
> Originally every sys_remap_file_pages did down_write, and there was a
> downgrade_write before the populate.  Doing down_write every time did
> hurt, so it was changed only to do that when first going nonlinear
> (to respect the locking on vm_flags), and we didn't bother to downgrade
> in that one instance.

> I don't mind you just adding a simple
> 		if (has_write_lock)
> 			downgrade_write(&mm->mmap_sem);
> before the populate (and getting rid of the has_write_lock condition
> further down), but anything more seems overkill to me.
Ok, will see with Ingo. The fact that it's done only on first time and it's 
not a spinlock agrees with you, anyway.
> > Also, drop a bust comment: we cannot clear VM_NONLINEAR simply because
> > code elsewhere is going to use it. At the very least, madvise_dontneed()
> > relies on that flag being set (remaining non-linear truncation read the
> > mapping list), but the list is probably longer, and going to increase.

> You misunderstand the comment (it could indeed be clearer), but you're
> right that the bit about downgrading the lock is out-of-date.
> I suggest you insert this comment instead.
> 		/*
> 		 * We would like to clear VM_NONLINEAR, in the case when
> 		 * sys_remap_file_pages covers the whole vma, so making
> 		 * it linear again.  But cannot do so until after a
> 		 * successful populate, and have no way to upgrade sem.
> 		 */
Aaaaaaaaah, ok.
> > Subject: [patch 04/18] remap_file_pages protection support: cleanup
> > syscall checks

> > This patch reorganizes the code only, without differences in behaviour.
> > It makes the code more readable on its own, and is needed for next
> > patches. I've split this out to avoid cluttering real patches.

> Okay, you're breaking up conditions nicely, but please break this one up
> too

> > +	if (!vma->vm_ops || !vma->vm_ops->populate || end <= start || start <
> > +			vma->vm_start || end > vma->vm_end)
> > +		goto out_unlock;
>
> into a vm_ops part and a start/end part.
Ok.
> > Subject: [patch 05/18] remap_file_pages protection support: enhance
> > syscall interface

> > @@ -203,7 +203,7 @@ asmlinkage long sys_remap_file_pages(uns
> >  	/* Can we represent this offset inside this architecture's pte's? */
> >  #if PTE_FILE_MAX_BITS < BITS_PER_LONG
> >  	if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS))
> > -		return err;
> > +		return -EOVERFLOW;
> >  #endif

> In the last patch you replaced all(?) the early returns by gotos;
> now in this one you reintroduce an early return.  Better be consistent
> and make this one a goto too.
Ok.
> > +	if (flags & MAP_NOINHERIT) {
> > +		err = -EPERM;
> > +		if (((prot & PROT_READ) && !(vma->vm_flags & VM_MAYREAD)))
> > +			goto out_unlock;
> > +		if (((prot & PROT_WRITE) && !(vma->vm_flags & VM_MAYWRITE)))
> > +			goto out_unlock;
> > +		if (((prot & PROT_EXEC) && !(vma->vm_flags & VM_MAYEXEC)))
> > +			goto out_unlock;
> > +		err = -EINVAL;
> > +		pgprot = protection_map[calc_vm_prot_bits(prot) | VM_SHARED];

> Please follow mprotect's way of doing this, calc_vm_prot_bits first then
> 		if ((newflags & ~(newflags >> 4)) & 0xf) {
> Except please do NOT say 0xf!  Use VM_READ|VM_WRITE|VM_EXEC instead
> (yeah, yeah, I know that only comes to 7).
Ok - nice trick hardcoding the shift between MAY and base flags.... when 
constants are hardcoded in assembler, nobody forgets comments.
> > Subject: [patch 06/18] remap_file_pages protection support: support
> > private vma for MAP_POPULATE

> > Fix MAP_POPULATE | MAP_PRIVATE. We don't need the VMA to be shared if we
> > don't rearrange pages around. And it's trivial to do.

> This seems reasonable to me.  I was afraid you were going to extend
> VM_NONLINEAR to private maps, but no, you're just letting private maps
> be populated linearly, that's fine.
Would that be a real problem, when limited to readonly mappings?
There's some interest in that, for library loading - a binary with 100 dso's 
has 300 vmas...

I see the problem with anonymous memory is bigger (not entirely different from 
the current situation).
> (Really we ought then also to 
> allow anonymous maps be pre-populated also, but that's a different
> patch, which wouldn't touch sys_remap_file_pages at all: forget it.)

> > --- linux-2.6.git/mm/mmap.c~rfp-private-vma-2	2005-08-24
> > 20:57:13.000000000 +0200 +++ linux-2.6.git-paolo/mm/mmap.c	2005-08-24
> > 20:57:13.000000000 +0200 @@ -1124,6 +1124,10 @@ out:
> >  	}
> >  	if (flags & MAP_POPULATE) {
> >  		up_write(&mm->mmap_sem);
> > +		/*
> > +		 * remap_file_pages() works even if the mapping is private,
> > +		 * in the linearly-mapped case:
> > +		 */

> But that's one of your historical comments.  The point of the patch is
> that it's peculiar not to populate the private mappings, so why comment
> when we're no longer doing something peculiar?  Please leave it out.
Ok.
> >  		sys_remap_file_pages(addr, len, 0,
> >  					pgoff, flags & MAP_NONBLOCK);
> >  		down_write(&mm->mmap_sem);

> > Subject: [patch 07/18] remap_file_pages protection support: safety net
> > for lazy arches

> > Since proper support requires that the arch at the very least handles
> > VM_FAULT_SIGSEGV, as in next patch (otherwise the arch may BUG), and
> > things are even more complex (see next patches), and it's triggerable
> > only with VM_NONUNIFORM vma's, simply refuse creating them if the arch
> > doesn't declare itself ready.

> > This is a very temporary hack, so I've clearly marked it as such. And,
> > with current rythms, I've given about 6 months for arches to get ready.
> > Reducing this time is perfectly ok for me.

> > +#ifndef __ARCH_SUPPORTS_VM_NONUNIFORM
> > +	if (flags & MAP_NOINHERIT)
> > +		goto out;
> > +#endif

> Any arch that doesn't have a definition of MAP_MANYPROTS in its asm/mman.h
> has its build broken anyways.  While you're giving them all MAP_MANYPROTS
> definitions, please just modify their fault handlers, as simply as
> possible. So skip the __ARCH_SUPPORTS_VM_NONUNIFORM business.
Argh, ok.
> > +++
> > linux-2.6.git-paolo/Documentation/feature-removal-schedule.txt	2005-08-24
> > 20:57:18.000000000 +0200 +
[...]
> This is very thorough and well-intentioned, but skip it - we do that
> for old features from which many drivers need converting, but it's
> not like that with the architectures.  Just supply the simple patches.

> > Subject: [patch 08/18] remap file pages protection support: use
> > FAULT_SIGSEGV for protection checking

> Ah ha, now we get to the heart of it.  (And nothing wrong with such a lead
> up, I too like a number of patches to set the stage before the big one.)

> Well, this is two patches.  First there's handling VM_FAULT_SIGSEGV in
> various arches.  Please minimize those patches for now: VM_FAULT_SIGSEGV
> is very sensible in itself, and with its help we might be able to move
> more code from arches to common in future; but for now, try to keep it
> to adding the
> 		case VM_FAULT_SIGSEGV:
> 			goto bad_area;
> in each arch fault handler.  Resist changing "survive" to "handle_fault".
> Resist adding "access_mask" and passing it in place of "write": that may
> be necessary later to handle VM_EXEC properly on a few architectures,
> but I suggest (particularly given the subset of arches of interest)
> that you start with just the read/write distinction, and if exec needs
> more add it later (even if it's just a later patch of the same series).

> And resist adding a preliminary VM_MANYPROTS check to each fault
> handler.  Let's say instead that the mmap/mprotect permissions are the
> maximum that the area can take (and therefore modify the check done on
> prots in sys_remap_file_pages, to make sure they do not exceed "default").

> Would that work out?
No, absolutely not, I'm sorry. Since I may have sparse mappings (I'll use this 
to emulate TLBs), we have a huge PROT_NONE mapping and then remap individual 
pages, after their allocation, at fault time, with more permissive settings.

That check may be moved later, to beginning of bad_area, if you say "default 
prots are the minimum one, I can only remap with more permissive settings". 
That would avoid affecting the fast path - even if the branch is clearly an 
"unlikely" one, so shouldn't give mispredictions at least.

If your problem is just avoiding changing all arch handlers, that amounts to 
letting the new API misbehave on them (a bit of an hack, but with 
 		case VM_FAULT_SIGSEGV:
 			goto bad_area;
everywhere it's still safe) or using my arch_supports trick.
> And then there's the core patch, to mm/memory.c.  But I'm afraid at this
> point, just when it gets interesting, my time and my patience run out.

> I still haven't come to a conclusion on the most interesting lines of
> all, do_no_page's
> 		if (write_access || unlikely(vma->vm_flags & VM_NONUNIFORM))
> 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> but I suspect you're wrong to be avoiding do_wp_page at all costs,
> should instead be adapting do_wp_page to do the right thing with the
> faults which arrive there.
Actually, today (in your discussion with "Heff" Dike) I realized the above 
change is redundant. pte_mkdirty is only needed if VM_WRITE (and not even 
there), and in that case, since we are doing shared mappings, the write 
access is already set in vma->vm_page_prot.

> And I think there's a serious flaw in handle_pte_fault, where it says

> > +	/* VM_NONUNIFORM vma's have PTE's always installed with the correct
> > +	 * protection. So, generate a SIGSEGV if a fault is caught there. */
> > +	if (unlikely(vma->vm_flags & VM_NONUNIFORM))
> > +		goto out_segv;

> Consider two threads faulting on the same pte on different cpus.
> One of them fixes it up, you're giving the other SIGSEGV?
> I think this runs quite deep and will need a rework.
Not so deep.
Weeeell, I was ready to ripping this out (even if for other reasons), I assume 
that comparing write_access/access_mask and the protections in the PTE is 
perfectly ok and fixes this.

Luckily, even here we have no regression wrt other apps.
> Sorry, I do not know when I can next face going over this,
> it's a hard task for me: perhaps someone else can take over.
For me is ok - just tell Andrew who should be appointed at this. There's not 
an "official" list of VM maintainers anywhere, even if I'm under the 
impression you're currently the top one.
> Hugh
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade


		
___________________________________ 
Yahoo! Messenger: chiamate gratuite in tutto il mondo 
http://it.beta.messenger.yahoo.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-09-04 19:10   ` [uml-devel] " Blaisorblade
@ 2005-09-07 12:00     ` Hugh Dickins
  2005-09-13 18:25       ` Blaisorblade
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Hugh Dickins @ 2005-09-07 12:00 UTC (permalink / raw)
  To: Blaisorblade
  Cc: user-mode-linux-devel, Andrew Morton, Ingo Molnar, Andi Kleen,
	LKML, Jeff Dike, Bodo Stroesser

On Sun, 4 Sep 2005, Blaisorblade wrote:
> On Friday 02 September 2005 23:02, Hugh Dickins wrote:
> > On Fri, 26 Aug 2005, Blaisorblade wrote:
> > > * The first 2 patches modify the PTE encoding macros and start preparing
> > > the VM for the new situation (i.e. VMA which have variable protections,
> > > which are called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name).
> 
> > What a pity: please revert.  VM_NONUNIFORM sounds impressive, but might
> > mean all kinds of things, maybe to do with NUMA.  VM_MANYPROTS is good,
> > it says what it means.
> Ok. Btw, before I forget: I assume I should redo the patches rather than fix 
> what you say on top of mine, (at least when not changing behaviour), right?

If it's something pervasive, like such naming, then yes you should redo
the patches.  Minor local corrections can be appended as an additional
patch, so long as they don't make the original patch ridiculous.

I don't understand your "(at least when not changing behaviour)":
if you mean that significant changes of behaviour should be done as extra
patches at the end, no: surely they should be patches of the main sequence.

All of this supposes that my opinion is to be followed.
I've given my opinions, Andi gave his, I don't remember others.
It doesn't mean any one of us is right.  Did you agree with mine?

> > "Inherit" is about parents and children, this is not; and furthermore,
> > some UNIXes had a MAP_INHERIT (see asm-alpha/mman.h) which was about
> > passing an mmap across exec.  Your MAP_NOINHERIT has nothing to do
> > with that.  MAP_MANYPROTS would help us to follow the trail more
> > easily (though it's true that you can't actually pass many prots
> > in to a single remap_file_pages call).
> MAP_CHGPROT? MAP_CHANGEPROT? MAP_REPROT?
> VM_MANYPROTS is internal name, so there's no reason to have the same name 
> either.

It doesn't have to be the same name, true, but I find it a lot easier
to follow the trail of functionality when similar naming is used.
Perhaps the "PROT" part is enough: MAP_SETPROT or MAP_CHGPROT or
MAP_CHANGEPROT if you don't like MAP_MANYPROTS.

> It's just what you remarked above. But we'd have separate macros and code 
> paths (not hugely separate): use the old macro version if VM_MANYPROTS clear, 
> use the new one if VM_MANYPROTS set.

I think those macros are grotesque enough already.
But I don't have a constructive suggestion.

> > No, I think I disagree with your choice here.  A reasonable person,
> > doing a successful, prohibitive (e.g. remove write access) mprotect
> > on a range, would expect those prohibitions then to be enforced across
> > the range.  Whereas you're letting existing entries (whether present
> > or paged out) retain the permissions they were given earlier with
> > remap_file_pages.
> A reasonable person using VM_MANYPROTS must study the new API anyway - and if 
> he went even to the trouble of writing twice the code to support even older 
> kernels (UML does), and he wants to prohibit access to a range, he'd rather 
> go with remap_file_pages (PROT_NONE), which does exactly what you expect, or 
> it would waste the advantage of avoiding VMA splitting.

We disagree quite strongly on what's reasonable for mprotect to do,
if it's to do anything interesting at all: we could indeed define it
to do special new things for this case (sorry, I've cut your suggestions),
but I'd much prefer not.

However, it looks like we can agree that for now it needn't do anything
on these areas but give an error.

> > Or, if that would really bloat the implementation (I don't see why,
> > but perhaps the encoding of absent entries, maybe manyprots, maybe
> > nonlinear, might do so), just fail with -EACCES when sys_mprotect
> > meets a manyprots vma, until someone asks for better.
> That is reasonable; -EINVAL makes more sense for me in this case (why 
> permission denied?)

But can we agree on which error?  I avoided -EINVAL, partly because it's
a very overloaded error code (perhaps the most overloaded of all?), so
often unhelpful to the user; and partly because for mprotect it appears
to relate to invalid PROT flags rather than anything else.  Whereas if
mprotect is applied to a HUGETLB area which it can't do anything with,
-EACCES is currently returned.  So I followed that precedent.

> > > Subject: [patch 06/18] remap_file_pages protection support: support
> > > private vma for MAP_POPULATE
> 
> > > Fix MAP_POPULATE | MAP_PRIVATE. We don't need the VMA to be shared if we
> > > don't rearrange pages around. And it's trivial to do.
> 
> > This seems reasonable to me.  I was afraid you were going to extend
> > VM_NONLINEAR to private maps, but no, you're just letting private maps
> > be populated linearly, that's fine.
> Would that be a real problem, when limited to readonly mappings?
> There's some interest in that, for library loading - a binary with 100 dso's 
> has 300 vmas...

You seem to see vmas as nothing but obstacles.  Perhaps there is a reason
we divide the mm into vmas?  But perhaps there is a better way of doing it.

Regarding nonlinear readonly.  I never asked Ingo why he excluded it -
suspect he didn't intend to, but missed the peculiar treatment of VM_SHARED
versus VM_MAYSHARE - my apologies, Ingo, if I'm underestimating you!  But
I was glad he had because it demands write access to the file being mapped
nonlinear.  Therefore the ordinary user cannot map libc.so nonlinear, and
condemn all users to the sledgehammer fashion of try_to_unmap_cluster.

Though thinking through that again now, the user of the nonlinear vma
is penalized, and the whole system is penalized by the difficulty in
reclaiming efficiently, but I don't see the other users of the library
particularly penalized (they might be unfairly advantaged by having its
pages stay unnaturally long in memory).  Either I was wrong before, or
I'm missing another aspect of it now: I don't know which.

> > And resist adding a preliminary VM_MANYPROTS check to each fault
> > handler.  Let's say instead that the mmap/mprotect permissions are the
> > maximum that the area can take (and therefore modify the check done on
> > prots in sys_remap_file_pages, to make sure they do not exceed "default").
> 
> > Would that work out?
> No, absolutely not, I'm sorry. Since I may have sparse mappings (I'll use this 
> to emulate TLBs), we have a huge PROT_NONE mapping and then remap individual 
> pages, after their allocation, at fault time, with more permissive settings.

I think I get the picture now: you start with a dark sky, then insert the
stars one by one.  Yes, I accept what I proposed would be of no use to you.

Okay, so as well as the VM_FAULT_SIGSEGV case, you do need to add that
preliminary check to each arch.

So far as I can see (I may have missed it), you really don't need to
change from the write boolean (perhaps -1 for exec in one arch??) to the
write_access mask: it's not a bad change, but the less you have to modify
each of the architectures, the better for now.

> That check may be moved later, to beginning of bad_area, if you say "default 
> prots are the minimum one, I can only remap with more permissive settings". 
> That would avoid affecting the fast path - even if the branch is clearly an 
> "unlikely" one, so shouldn't give mispredictions at least.

Would it?  Anyway, although it fits with UML's particular usage (starting
from PROT_NONE), your minimum permissions idea seems neither natural nor
useful to me (and I don't think you're convinced by it either, just trying
to offer a constructive alternative).

No, just stick with the "unlikely" test in the obvious place where you
already put it, and leave it for later optimization to rearrange that if
necessary: perhaps Christoph L will want modify the ia64 path, to move
that test down into the bad_area handling, to jump back if manyprots;
but no need for spaghetti in the initial implementation.

> > And I think there's a serious flaw in handle_pte_fault, where it says
> 
> > > +	/* VM_NONUNIFORM vma's have PTE's always installed with the correct
> > > +	 * protection. So, generate a SIGSEGV if a fault is caught there. */
> > > +	if (unlikely(vma->vm_flags & VM_NONUNIFORM))
> > > +		goto out_segv;
> 
> > Consider two threads faulting on the same pte on different cpus.
> > One of them fixes it up, you're giving the other SIGSEGV?
> > I think this runs quite deep and will need a rework.
> Not so deep.
> Weeeell, I was ready to ripping this out (even if for other reasons), I assume 
> that comparing write_access/access_mask and the protections in the PTE is 
> perfectly ok and fixes this.

Probably, yes.  Let's see what you come up with next and think it over.

> > Sorry, I do not know when I can next face going over this,
> > it's a hard task for me: perhaps someone else can take over.
> For me is ok - just tell Andrew who should be appointed at this. There's not 
> an "official" list of VM maintainers anywhere, even if I'm under the 
> impression you're currently the top one.

Mercifully for you, me and Linux, no.  Andrew was mm maintainer when he
started the -mm tree, and remains so.  But obviously he has a very heavy
load, and may look to others for opinions.  He didn't ask me on this,
but remap_file_pages does fall into one of my areas of concern, so I
felt obliged to give my opinions.  I guess that I shall have to look
at a later version, but I badly need to focus on the page_table_lock
again before spending time elsewhere: I get very sick of holding up
other people's requirements, but yours is not the first in line.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-09-07 12:00     ` Hugh Dickins
@ 2005-09-13 18:25       ` Blaisorblade
  2005-09-20 15:06       ` Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3) Blaisorblade
  2005-09-26 15:58       ` [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3 Blaisorblade
  2 siblings, 0 replies; 16+ messages in thread
From: Blaisorblade @ 2005-09-13 18:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: user-mode-linux-devel, Andrew Morton, Ingo Molnar, Andi Kleen,
	LKML, Jeff Dike, Bodo Stroesser

On Wednesday 07 September 2005 14:00, Hugh Dickins wrote:
> On Sun, 4 Sep 2005, Blaisorblade wrote:
> > On Friday 02 September 2005 23:02, Hugh Dickins wrote:
> > > On Fri, 26 Aug 2005, Blaisorblade wrote:
> > > > * The first 2 patches modify the PTE encoding macros and start
> > > > preparing the VM for the new situation (i.e. VMA which have variable
> > > > protections, which are called VM_NONUNIFORM. I dropped the early
> > > > VM_MANYPROTS name).

> If it's something pervasive, like such naming, then yes you should redo
> the patches.  Minor local corrections can be appended as an additional
> patch, so long as they don't make the original patch ridiculous.
>
> I don't understand your "(at least when not changing behaviour)":
> if you mean that significant changes of behaviour should be done as extra
> patches at the end, no: surely they should be patches of the main sequence.
Perfectly fine.
> All of this supposes that my opinion is to be followed.
> I've given my opinions, Andi gave his, I don't remember others.
> It doesn't mean any one of us is right.  Did you agree with mine?
For me it's no problem renaming, I see your point, have no particular reason 
to insist with mine.

> > > "Inherit" is about parents and children, this is not;
[...]

> > MAP_CHGPROT? MAP_CHANGEPROT? MAP_REPROT?
> > VM_MANYPROTS is internal name, so there's no reason to have the same name
> > either.

> It doesn't have to be the same name, true, but I find it a lot easier
> to follow the trail of functionality when similar naming is used.
> Perhaps the "PROT" part is enough: MAP_SETPROT or MAP_CHGPROT or
> MAP_CHANGEPROT if you don't like MAP_MANYPROTS.
MAP_CHGPROT is perfectly ok...
> > It's just what you remarked above. But we'd have separate macros and code
> > paths (not hugely separate): use the old macro version if VM_MANYPROTS
> > clear, use the new one if VM_MANYPROTS set.

> I think those macros are grotesque enough already.
That change wouldn't be in the macro themselves. We'd have a slightly 
different path in the caller to handle that.
> But I don't have a constructive suggestion.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade


	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3)
  2005-09-07 12:00     ` Hugh Dickins
  2005-09-13 18:25       ` Blaisorblade
@ 2005-09-20 15:06       ` Blaisorblade
  2005-09-20 18:23         ` Rik van Riel
  2005-09-21 15:16         ` Hugh Dickins
  2005-09-26 15:58       ` [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3 Blaisorblade
  2 siblings, 2 replies; 16+ messages in thread
From: Blaisorblade @ 2005-09-20 15:06 UTC (permalink / raw)
  To: Hugh Dickins, Rik van Riel; +Cc: Andrew Morton, LKML, Ingo Molnar

On Wednesday 07 September 2005 14:00, Hugh Dickins wrote:
> On Sun, 4 Sep 2005, Blaisorblade wrote:
> > On Friday 02 September 2005 23:02, Hugh Dickins wrote:
> > > On Fri, 26 Aug 2005, Blaisorblade wrote:
> > > > Subject: [patch 06/18] remap_file_pages protection support: support
> > > > private vma for MAP_POPULATE

> > > [...]
> > > you're just letting private maps
> > > be populated linearly, that's fine.

> > Would that be a real problem, when limited to readonly mappings?

> Regarding nonlinear readonly.  I never asked Ingo why he excluded it -
> suspect he didn't intend to, but missed the peculiar treatment of VM_SHARED
> versus VM_MAYSHARE - my apologies, Ingo, if I'm underestimating you!
Ahh, ok... VM_MAYSHARE is the recorded MAP_SHARED, while VM_SHARED says 
whether the pages are actually shared and writable.
> But 
> I was glad he had because it demands write access to the file being mapped
> nonlinear.  Therefore the ordinary user cannot map libc.so nonlinear, and
> condemn all users to the sledgehammer fashion of try_to_unmap_cluster.

> Though thinking through that again now, the user of the nonlinear vma
> is penalized,

Where? Not in the page fault path.... it's as penalized as the rest of the 
system. Or will direct reclaim have a preference for pages of the calling 
process?
> and the whole system is penalized by the difficulty in 
> reclaiming efficiently, but I don't see the other users of the library
> particularly penalized (they might be unfairly advantaged by having its
> pages stay unnaturally long in memory).
Those pages would be either needed (and wouldn't be swapped anyway) or 
unneeded (and thus they'd waste memory).

But the waste is possible even currently. If not having rmap were a local DoS, 
well, an unprivileged user may well mmap and remap nonlinearly some really 
big files.

So, it would really be better to actually enforce the RSS rlimit when mapping 
in pages in *nonlinear* areas (and fallback on setting file PTE's like on 
NONBLOCK & page not in cache), rather than the "current" Rik's idea of 
marking pages as inactive on memory-hog processes.

But oh, right in mm/trash.c, the code which should do part of this is fully 
commented out - and it was in the very first version of the code (looking 
through bkcvs-git repository).

And the RLIMIT_RSS is totally unused - I bet Rik's patch didn't manage to go 
in, or it's me missing something?
> Either I was wrong before, or 
> I'm missing another aspect of it now: I don't know which.

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade


	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3)
  2005-09-20 15:06       ` Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3) Blaisorblade
@ 2005-09-20 18:23         ` Rik van Riel
  2005-09-21 15:16         ` Hugh Dickins
  1 sibling, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2005-09-20 18:23 UTC (permalink / raw)
  To: Blaisorblade; +Cc: Hugh Dickins, Andrew Morton, LKML, Ingo Molnar

On Tue, 20 Sep 2005, Blaisorblade wrote:

> And the RLIMIT_RSS is totally unused - I bet Rik's patch didn't manage 
> to go in, or it's me missing something?

It never went into the kernel, because Andrew did not find
a workload where the code provided benefits.  Sounds fair
to me ;)

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3)
  2005-09-20 15:06       ` Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3) Blaisorblade
  2005-09-20 18:23         ` Rik van Riel
@ 2005-09-21 15:16         ` Hugh Dickins
  2005-09-21 16:16           ` Blaisorblade
  1 sibling, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2005-09-21 15:16 UTC (permalink / raw)
  To: Blaisorblade; +Cc: Rik van Riel, Andrew Morton, LKML, Ingo Molnar

On Tue, 20 Sep 2005, Blaisorblade wrote:
> On Wednesday 07 September 2005 14:00, Hugh Dickins wrote:
> 
> Ahh, ok... VM_MAYSHARE is the recorded MAP_SHARED, while VM_SHARED says 
> whether the pages are actually shared and writable.

That's it (let's say "potentially writable" - VM_SHARED says they're
shared and already writable or shared and could be mprotected writable).

It is confusing, but it's been that way for years, and once you grasp it,
you like to keep it that way, so as to feel superior to everyone else ;)

> > Though thinking through that again now, the user of the nonlinear vma
> > is penalized,
> 
> Where? Not in the page fault path.... it's as penalized as the rest of the 
> system. Or will direct reclaim have a preference for pages of the calling 
> process?

Not in the page fault path, no; and no, direct reclaim doesn't have that
preference (sounds quite a good idea, but would need a different way of
choosing pages to free, and would conflict with the swap token idea?).

I meant in reclaim: that since try_to_unmap_cluster makes such poor
choices of what to reclaim, and tries to reclaim more than asked because
it's unable to target the page in question, I'd expect the user of a
nonlinear vma to suffer more thrashing under memory pressure.

> So, it would really be better to actually enforce the RSS rlimit when mapping 
> in pages in *nonlinear* areas (and fallback on setting file PTE's like on 
> NONBLOCK & page not in cache), rather than the "current" Rik's idea of 
> marking pages as inactive on memory-hog processes.

I'd go mad if I delved into these issues, luckily we have suckers like Rik
and Nick and Con who are prepared to give their lives to such endless tuning.

> But oh, right in mm/trash.c, the code which should do part of this is fully 
> commented out - and it was in the very first version of the code (looking 
> through bkcvs-git repository).

mm/trash.c?  I got quite excited, but now it looks like you just mean
mm/thrash.c.  Yawn.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3)
  2005-09-21 15:16         ` Hugh Dickins
@ 2005-09-21 16:16           ` Blaisorblade
  2005-09-21 16:50             ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Blaisorblade @ 2005-09-21 16:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Rik van Riel, Andrew Morton, LKML, Ingo Molnar

On Wednesday 21 September 2005 17:16, Hugh Dickins wrote:
> On Tue, 20 Sep 2005, Blaisorblade wrote:
> > On Wednesday 07 September 2005 14:00, Hugh Dickins wrote:

> > Ahh, ok... VM_MAYSHARE is the recorded MAP_SHARED, while VM_SHARED says
> > whether the pages are actually shared and writable.

> That's it (let's say "potentially writable" - VM_SHARED says they're
> shared and already writable or shared and could be mprotected writable).

> It is confusing, but it's been that way for years, and once you grasp it,
> you like to keep it that way, so as to feel superior to everyone else ;)

> > > Though thinking through that again now, the user of the nonlinear vma
> > > is penalized,

> > Where? Not in the page fault path.... it's as penalized as the rest of
> > the system. Or will direct reclaim have a preference for pages of the
> > calling process?

> Not in the page fault path, no; and no, direct reclaim doesn't have that
> preference (sounds quite a good idea, but would need a different way of
> choosing pages to free, and would conflict with the swap token idea?).

> I meant in reclaim: that since try_to_unmap_cluster makes such poor
> choices of what to reclaim, and tries to reclaim more than asked because
> it's unable to target the page in question, I'd expect the user of a
> nonlinear vma to suffer more thrashing under memory pressure.

Other pages in the VMA may be unmapped, yes, but not freed. In fact, they're 
kept in by the pagecache reference; try_to_unmap() (or better its caller, 
shrink_list) will only actually free the page it asked for.

The only real "problem" is that we do ptep_clear_flush_young without 
activating the page. And yes, *this* may penalize who holds a nonlinear VMA. 
But this is probably fair, given that we're going to have trouble in freeing 
those pages.

> > So, it would really be better to actually enforce the RSS rlimit when
> > mapping in pages in *nonlinear* areas (and fallback on setting file PTE's
> > like on NONBLOCK & page not in cache), rather than the "current" Rik's
> > idea of marking pages as inactive on memory-hog processes.

> I'd go mad if I delved into these issues, luckily we have suckers like Rik
> and Nick and Con who are prepared to give their lives to such endless
> tuning.

> > But oh, right in mm/trash.c, the code which should do part of this is
> > fully commented out - and it was in the very first version of the code
> > (looking through bkcvs-git repository).

> mm/trash.c?  I got quite excited,
What would that have meant?
> but now it looks like you just mean 
> mm/thrash.c.
Yes.
> Yawn. 

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3)
  2005-09-21 16:16           ` Blaisorblade
@ 2005-09-21 16:50             ` Hugh Dickins
  2005-09-21 17:02               ` Blaisorblade
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2005-09-21 16:50 UTC (permalink / raw)
  To: Blaisorblade; +Cc: Rik van Riel, Andrew Morton, LKML, Ingo Molnar

On Wed, 21 Sep 2005, Blaisorblade wrote:
> 
> Other pages in the VMA may be unmapped, yes, but not freed. In fact, they're 
> kept in by the pagecache reference; try_to_unmap() (or better its caller, 
> shrink_list) will only actually free the page it asked for.

Not freed in that pass, yes; but brought closer to being freed soon.

> The only real "problem" is that we do ptep_clear_flush_young without 
> activating the page. And yes, *this* may penalize who holds a nonlinear VMA. 
> But this is probably fair, given that we're going to have trouble in freeing 
> those pages.

Good point, I don't remember ever considering that.
But agree it should work out fairly.

> > mm/trash.c?  I got quite excited,
> What would that have meant?

Trash is rubbish or garbage.  Or if I trash my hotel room (not me!),
I'd rip the washbasin off the wall, smash the mirror, throw the
chair through the window, ... hmm, better stop this public fantasy.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3)
  2005-09-21 16:50             ` Hugh Dickins
@ 2005-09-21 17:02               ` Blaisorblade
  0 siblings, 0 replies; 16+ messages in thread
From: Blaisorblade @ 2005-09-21 17:02 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Rik van Riel, Andrew Morton, LKML, Ingo Molnar

On Wednesday 21 September 2005 18:50, Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Blaisorblade wrote:
> > Other pages in the VMA may be unmapped, yes, but not freed. In fact,
> > they're kept in by the pagecache reference; try_to_unmap() (or better its
> > caller, shrink_list) will only actually free the page it asked for.

> Not freed in that pass, yes; but brought closer to being freed soon.

Will a page with mapcount == 0 be put in the inactive list explicitly? At next 
scan PageActive will be clear, sure, and it won't be reactivated while it's 
unmapped.

But references will only cause more hardware faults.
> > The only real "problem" is that we do ptep_clear_flush_young without
> > activating the page. And yes, *this* may penalize who holds a nonlinear
> > VMA. But this is probably fair, given that we're going to have trouble in
> > freeing those pages.

> Good point, I don't remember ever considering that.
> But agree it should work out fairly.

> > > mm/trash.c?  I got quite excited,

> > What would that have meant?

> Trash is rubbish or garbage.  Or if I trash my hotel room (not me!),
> I'd rip the washbasin off the wall, smash the mirror, throw the
> chair through the window, ... hmm, better stop this public fantasy.

Nice... hope this can get to LWN "quotes of the week" page.

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-09-07 12:00     ` Hugh Dickins
  2005-09-13 18:25       ` Blaisorblade
  2005-09-20 15:06       ` Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3) Blaisorblade
@ 2005-09-26 15:58       ` Blaisorblade
  2005-09-28 13:37         ` Hugh Dickins
  2 siblings, 1 reply; 16+ messages in thread
From: Blaisorblade @ 2005-09-26 15:58 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: LKML, Ulrich Drepper

On Wednesday 07 September 2005 14:00, Hugh Dickins wrote:
> On Sun, 4 Sep 2005, Blaisorblade wrote:
> > > This seems reasonable to me.  I was afraid you were going to extend
> > > VM_NONLINEAR to private maps, but no, you're just letting private maps
> > > be populated linearly, that's fine.

> > Would that be a real problem, when limited to readonly mappings?
> > There's some interest in that, for library loading - a binary with 100
> > dso's has 300 vmas...

> You seem to see vmas as nothing but obstacles.
> Perhaps there is a reason 
> we divide the mm into vmas?
Let's not exaggerate, I've not suggested dropping VMA entirely.

It's just that UML is a workload suffering for the tree lookup, and Ulrich 
Drepper found another case where the tree lookup was a problem.

I also thought to the lookup function in itself, but I've no easy road to go 
(I've been thinking to radix trees, but they can be way worse on 64bit with 
many little mappings) and that is even more invasive (though it would 
theoretically be a cleaner solution).

> But perhaps there is a better way of doing it. 
For instance, replacing PROT_NONE vmas, which are inserted just to prevent 
mappings in those areas (which must be instead left as guard pages), with 
simple PROT_NONE ptes, is fairly simple.

[...]
> I think I get the picture now: you start with a dark sky, then insert the
> stars one by one.  Yes, I accept what I proposed would be of no use to you.

> Okay, so as well as the VM_FAULT_SIGSEGV case, you do need to add that
> preliminary check to each arch.

> So far as I can see (I may have missed it), you really don't need to
> change from the write boolean

> (perhaps -1 for exec in one arch??)
? Not understood this part, ignoring it?
Maybe you mean "except one arch, x86_64, which supports exec protection?"
> to the  
> write_access mask: it's not a bad change, but the less you have to modify
> each of the architectures, the better for now.
Yes, I see your point. Using VM_READ, especially, is often unneeded (I think 
we don't do "executable, not readable" or "writable, not readable"). I've now 
simplified the handling for the arches, allowing them to be changed much 
less.

However, I think it's better to leave the generic code as-is, while extending 
the wrapper, to pass VM_READ too. So, only archs wanting to do execute 
protection, or to support strange things as "write allowed, read disallowed", 
will need to use access_mask.
> > That check may be moved later, to beginning of bad_area, if you say
> > "default prots are the minimum one, I can only remap with more permissive
> > settings". That would avoid affecting the fast path - even if the branch
> > is clearly an "unlikely" one, so shouldn't give mispredictions at least.

> Would it?  Anyway, although it fits with UML's particular usage (starting
> from PROT_NONE), your minimum permissions idea seems neither natural nor
> useful to me (and I don't think you're convinced by it either, just trying
> to offer a constructive alternative).

Exactly... I don't like it, even because it would disallow using 
r_f_p(PROT_NONE) in place of PROT_NONE vma for guard pages.

> No, just stick with the "unlikely" test in the obvious place where you
> already put it, and leave it for later optimization to rearrange that if
> necessary: perhaps Christoph L will want modify the ia64 path, to move
> that test down into the bad_area handling, to jump back if manyprots;
That is not an optimization, but a breakage, for the case where I have (say) a 
readable mapping, but a PROT_NONE page in it. I know because I've hit this 
problem with my testprogram.

Unless you use the "minimum prots" ideas, or there is something I'm missing in 
ia64 (it seems not - and IIRC it supports branch prediction hints from the 
compiler).
> but no need for spaghetti in the initial implementation.

> > > Sorry, I do not know when I can next face going over this,
> > > it's a hard task for me: perhaps someone else can take over.

> I get very sick of holding up 
> other people's requirements, but yours is not the first in line.
Don't worry for that, and I really appreciate your review on my first VM work. 
Thanks a lot!
> Hugh

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade


	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-09-26 15:58       ` [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3 Blaisorblade
@ 2005-09-28 13:37         ` Hugh Dickins
  2005-09-28 16:20           ` Blaisorblade
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2005-09-28 13:37 UTC (permalink / raw)
  To: Blaisorblade; +Cc: LKML, Ulrich Drepper

Sorry, it's really hard to read your interspersed comments.  Perhaps I
need to switch on some colour option when reading your mails, but I've
never found the need for it before.  Please, use a blank line above
and below your comments to help us locate them and read them, thanks.

On Mon, 26 Sep 2005, Blaisorblade wrote:
> On Wednesday 07 September 2005 14:00, Hugh Dickins wrote:
> 
> > So far as I can see (I may have missed it), you really don't need to
> > change from the write boolean
> 
> > (perhaps -1 for exec in one arch??)
> ? Not understood this part, ignoring it?
> Maybe you mean "except one arch, x86_64, which supports exec protection?"

No, I meant the current code uses "0" for read fault, "1" for write fault,
and (in a quick search) only found one architecture (I forget which,
certainly not x86_64) which might have been interested to pass down
a different value to handle_mm_fault to distinguish execution fault:
for which I was suggesting to use "-1", rather than change everywhere.
Though now I'm doubting there was any such case at all.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3
  2005-09-28 13:37         ` Hugh Dickins
@ 2005-09-28 16:20           ` Blaisorblade
  0 siblings, 0 replies; 16+ messages in thread
From: Blaisorblade @ 2005-09-28 16:20 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: LKML

On Wednesday 28 September 2005 15:37, Hugh Dickins wrote:
> Sorry, it's really hard to read your interspersed comments.  Perhaps I
> need to switch on some colour option when reading your mails, but I've
> never found the need for it before.  Please, use a blank line above
> and below your comments to help us locate them and read them, thanks.

Ok, will do.

> On Mon, 26 Sep 2005, Blaisorblade wrote:
> > On Wednesday 07 September 2005 14:00, Hugh Dickins wrote:
> > > So far as I can see (I may have missed it), you really don't need to
> > > change from the write boolean
> > >
> > > (perhaps -1 for exec in one arch??)
> >
> > ? Not understood this part, ignoring it?
> > Maybe you mean "except one arch, x86_64, which supports exec protection?"
>
> No, I meant the current code uses "0" for read fault, "1" for write fault,
> and (in a quick search) only found one architecture (I forget which,
> certainly not x86_64) which might have been interested to pass down
> a different value to handle_mm_fault to distinguish execution fault:
> for which I was suggesting to use "-1", rather than change everywhere.

In my local tree I've restricted the changes to generic arch code, and to 
x86_64 which uses the __handle_mm_fault with VM_EXEC if needed. The rest keep 
using handle_mm_fault (there's still the check for VM_MANYPROTS).

> Though now I'm doubting there was any such case at all.

> Hugh

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-09-28 16:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-26 18:23 [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3 Blaisorblade
2005-08-26 19:11 ` Hugh Dickins
2005-08-26 19:58   ` [uml-devel] " Blaisorblade
2005-09-02 21:02 ` Hugh Dickins
2005-09-04 19:10   ` [uml-devel] " Blaisorblade
2005-09-07 12:00     ` Hugh Dickins
2005-09-13 18:25       ` Blaisorblade
2005-09-20 15:06       ` Remap_file_pages, RSS limits, security implications (was: Re: [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3) Blaisorblade
2005-09-20 18:23         ` Rik van Riel
2005-09-21 15:16         ` Hugh Dickins
2005-09-21 16:16           ` Blaisorblade
2005-09-21 16:50             ` Hugh Dickins
2005-09-21 17:02               ` Blaisorblade
2005-09-26 15:58       ` [uml-devel] Re: [RFC] [patch 0/18] remap_file_pages protection support (for UML), try 3 Blaisorblade
2005-09-28 13:37         ` Hugh Dickins
2005-09-28 16:20           ` Blaisorblade

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).