Fix mlockall for PROT_NONE mappings
diff mbox series

Message ID 20031009104218.GA1935@averell
State New, archived
Headers show
Series
  • Fix mlockall for PROT_NONE mappings
Related show

Commit Message

Andi Kleen Oct. 9, 2003, 10:42 a.m. UTC
The x86-64 ld.so always puts a PROT_NONE mapping into every 64bit process'
address space.

2a95791000-2a9586d000 ---p 00124000 03:01 788090                         /lib64/libc.so.6

This broke mlockall on x86-64 which ran into this mapping and always
returned an error and stopped early before mlocking everything.

This patch fixes mlockall to ignore errors for such mappings.

I added a new argument force==2 to get_user_pages that means to ignore
SIGBUS or unaccessible pages errors. MAY_* is still checked for like
with the old force ==1, it just doesn't error out now for SIGBUS
errors on handle_mm_fault. 

make_pages_present also has a new "tolerant" argument now that sets
force == 2. The patch sets it for mlockall and mmap (which didn't 
check its return arguments), but not for mlock(). Arguably it could
be not set for mmap on a VM_LOCKED vma.

-Andi


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Muli Ben-Yehuda Oct. 9, 2003, 10:49 a.m. UTC | #1
On Thu, Oct 09, 2003 at 12:42:18PM +0200, Andi Kleen wrote:

> I added a new argument force==2 to get_user_pages that means to ignore
> SIGBUS or unaccessible pages errors. MAY_* is still checked for like
> with the old force ==1, it just doesn't error out now for SIGBUS
> errors on handle_mm_fault. 

How about making it an enum or define for code readability? I'd much
rather see an IGNORE_BAD_PAGES or some such than a cryptic '2' in the
code. I can send a patch to that effect if you'd like? 
 
Cheers, 
Muli
Andi Kleen Oct. 9, 2003, 11:22 a.m. UTC | #2
On Thu, Oct 09, 2003 at 12:49:18PM +0200, Muli Ben-Yehuda wrote:
> On Thu, Oct 09, 2003 at 12:42:18PM +0200, Andi Kleen wrote:
> 
> > I added a new argument force==2 to get_user_pages that means to ignore
> > SIGBUS or unaccessible pages errors. MAY_* is still checked for like
> > with the old force ==1, it just doesn't error out now for SIGBUS
> > errors on handle_mm_fault. 
> 
> How about making it an enum or define for code readability? I'd much
> rather see an IGNORE_BAD_PAGES or some such than a cryptic '2' in the
> code. I can send a patch to that effect if you'd like? 

Doesn't look essential. You could submit it as a follow-up patch as soon
as Linus merged this version, but I'm not sure it satisfies the current
"no more cleanups" rule because it isn't a bugfix.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Muli Ben-Yehuda Oct. 9, 2003, 11:24 a.m. UTC | #3
On Thu, Oct 09, 2003 at 01:22:45PM +0200, Andi Kleen wrote:
> On Thu, Oct 09, 2003 at 12:49:18PM +0200, Muli Ben-Yehuda wrote:
> > On Thu, Oct 09, 2003 at 12:42:18PM +0200, Andi Kleen wrote:
> > 
> > > I added a new argument force==2 to get_user_pages that means to ignore
> > > SIGBUS or unaccessible pages errors. MAY_* is still checked for like
> > > with the old force ==1, it just doesn't error out now for SIGBUS
> > > errors on handle_mm_fault. 
> > 
> > How about making it an enum or define for code readability? I'd much
> > rather see an IGNORE_BAD_PAGES or some such than a cryptic '2' in the
> > code. I can send a patch to that effect if you'd like? 
> 
> Doesn't look essential. You could submit it as a follow-up patch as soon
> as Linus merged this version, but I'm not sure it satisfies the current
> "no more cleanups" rule because it isn't a bugfix.

That's exactly the reason I wrote to you in the first place - if your
patch gets merged *with* the readability cleanups, it is irrelevant
whether the a follow up patch is acceptible or not.  

I agree that improved readability is not "essential". Do you agree
that it's preferable?
Andi Kleen Oct. 9, 2003, 11:31 a.m. UTC | #4
> I agree that improved readability is not "essential". Do you agree
> that it's preferable?  

I don't think this version is particularly unreadable or that your
change would improve it very much.
(to make any sense of a function with 6+ arguments you have to look
at the function definition, no way around that, and there is a comment
that explains it there), so I don't think the change is particularly
important.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 9, 2003, 2:44 p.m. UTC | #5
This is too ugly for words.

Just make mlockall() ignore any mappings that are not readable.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Oct. 9, 2003, 2:52 p.m. UTC | #6
On Thu, Oct 09, 2003 at 07:44:08AM -0700, Linus Torvalds wrote:
> 
> This is too ugly for words.
> 
> Just make mlockall() ignore any mappings that are not readable.

That is exactly what the patch is doing.

It has just to pass this information down to get_user_pages to make mlock() 
not ignore them too.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 9, 2003, 2:56 p.m. UTC | #7
On 9 Oct 2003, Andi Kleen wrote:
> 
> That is exactly what the patch is doing.

No it's not.

What I'm asking for is a simple

	if (vma->vm_flags & VM_READ)
		make_pages_readable();

kind of thing. A couple of one-liners in the _callers_, not a horribly 
ugly change way down the stack.

Flags are ugly. Multi-value flags that have magic constants are worse. 
This patch deserves to die.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Oct. 9, 2003, 3:12 p.m. UTC | #8
On Thu, Oct 09, 2003 at 07:56:07AM -0700, Linus Torvalds wrote:
> 
> On 9 Oct 2003, Andi Kleen wrote:
> > 
> > That is exactly what the patch is doing.
> 
> No it's not.
> 
> What I'm asking for is a simple
> 
> 	if (vma->vm_flags & VM_READ)
> 		make_pages_readable();
> 
> kind of thing. A couple of one-liners in the _callers_, not a horribly 
> ugly change way down the stack.

Ok. But what is with mappings that have MAY_READ not set ?
[not 100% this cannot happen]

Without the changes make_pages_present didn't call get_user_pages
with the "force" argument. And changing it unconditionally would
change the behaviour for everybody.

Also there is still the problem that it will fail early for SIGBUS
that happens for other reasons (e.g. a hardware driver not being
able to mmap everything) 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 9, 2003, 3:17 p.m. UTC | #9
On 9 Oct 2003, Andi Kleen wrote:
> 
> Ok. But what is with mappings that have MAY_READ not set ?
> [not 100% this cannot happen]

Well, PROT_NONE will not have MAY_READ set, and no, they won't get locked 
into memory.

But neither did they get locked in before either, so it's not like it's a 
new issue. Now we'd just ignore them nicely rather than abort the 
lockall(). Hmm?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Oct. 9, 2003, 3:33 p.m. UTC | #10
On Thu, Oct 09, 2003 at 08:17:48AM -0700, Linus Torvalds wrote:
> 
> On 9 Oct 2003, Andi Kleen wrote:
> > 
> > Ok. But what is with mappings that have MAY_READ not set ?
> > [not 100% this cannot happen]
> 
> Well, PROT_NONE will not have MAY_READ set, and no, they won't get locked 
> into memory.
> 
> But neither did they get locked in before either, so it's not like it's a 
> new issue. Now we'd just ignore them nicely rather than abort the 
> lockall(). Hmm?

Ok, here's a new version.

-------------------------------------------------------------------------

Fix mlockall when there are unreadable mappings in the address space
(happens frequently on x86-64 where ld.so creates an unreadable mapping
for libc). Ignores mappings that don't have MAY_READ set in the high level 
mlockall function.

As an improvement it continues mlockall now for all mappings even when an 
error occurs. The error is still reported, but only the first one.

This should do the right thing for applications that do mlockall(MCL_CURRENT)
but don't want to worry about some weird mappings e.g. from DRI in their
address space.


diff -u linux-test7-work/mm/mlock.c-o linux-test7-work/mm/mlock.c
--- linux-test7-work/mm/mlock.c-o	2003-09-28 10:53:25.000000000 +0200
+++ linux-test7-work/mm/mlock.c	2003-12-04 16:33:58.000000000 +0100
@@ -135,7 +135,7 @@
 
 static int do_mlockall(int flags)
 {
-	int error;
+	int error, nerr;
 	unsigned int def_flags;
 	struct vm_area_struct * vma;
 
@@ -154,9 +154,15 @@
 		newflags = vma->vm_flags | VM_LOCKED;
 		if (!(flags & MCL_CURRENT))
 			newflags &= ~VM_LOCKED;
-		error = mlock_fixup(vma, vma->vm_start, vma->vm_end, newflags);
-		if (error)
-			break;
+
+		if (!(newflags & VM_READ))
+			continue;
+
+	        nerr = mlock_fixup(vma, vma->vm_start, vma->vm_end, newflags);
+		/* When an error occurs continue with the next mapping, 
+		   but report only the first one */
+		if (nerr && !error)
+			error = nerr;
 	}
 	return error;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 9, 2003, 3:40 p.m. UTC | #11
On Thu, 9 Oct 2003, Andi Kleen wrote:
> 
> Ok, here's a new version.

Still won't work.

You need to call mlock_fixup() to split the vma properly (it might not 
cover the whole vma, and we _do_ want to keep track of the VM_LOCKED flag 
correctly.

I _think_ the proper fix is something trivial like this, but it's 
obviously untested. Please verify

		Linus

----
--- 1.6/mm/mlock.c	Sun Sep 21 14:50:36 2003
+++ edited/mm/mlock.c	Thu Oct  9 08:39:25 2003
@@ -43,7 +43,8 @@
 	pages = (end - start) >> PAGE_SHIFT;
 	if (newflags & VM_LOCKED) {
 		pages = -pages;
-		ret = make_pages_present(start, end);
+		if (newflags & VM_READ)
+			ret = make_pages_present(start, end);
 	}
 
 	vma->vm_mm->locked_vm -= pages;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andi Kleen Oct. 9, 2003, 4:34 p.m. UTC | #12
On Thu, Oct 09, 2003 at 08:40:09AM -0700, Linus Torvalds wrote:
> 
> On Thu, 9 Oct 2003, Andi Kleen wrote:
> > 
> > Ok, here's a new version.
> 
> Still won't work.
> 
> You need to call mlock_fixup() to split the vma properly (it might not 
> cover the whole vma, and we _do_ want to keep track of the VM_LOCKED flag 
> correctly.

I don't think so. mlockall() should never split anything.

mlock may, but I didn't change its behaviour. 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Oct. 9, 2003, 5:28 p.m. UTC | #13
On 9 Oct 2003, Andi Kleen wrote:
> 
> I don't think so. mlockall() should never split anything.

Mut mlock_fixup() _also_ does:
 - set the VM_LOCKED bit
 - do page locked accounting.

So if you don't call it, you may have interesting problems later on.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch
diff mbox series

diff -u linux-test7-work/mm/memory.c-o linux-test7-work/mm/memory.c
--- linux-test7-work/mm/memory.c-o	2003-10-09 00:29:01.000000000 +0200
+++ linux-test7-work/mm/memory.c	2003-12-04 11:27:43.000000000 +0100
@@ -674,12 +674,15 @@ 
 
 static inline struct page *get_page_map(struct page *page)
 {
+	if (!page)
+		return 0;
 	if (!pfn_valid(page_to_pfn(page)))
 		return 0;
 	return page;
 }
 
 
+/* force == 2 means tolerate holes in the mapping */
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, int len, int write, int force,
 		struct page **pages, struct vm_area_struct **vmas)
@@ -763,6 +766,10 @@ 
 					tsk->maj_flt++;
 					break;
 				case VM_FAULT_SIGBUS:
+					if (force == 2) { 
+						map = NULL;
+						break;
+					}
 					return i ? i : -EFAULT;
 				case VM_FAULT_OOM:
 					return i ? i : -ENOMEM;
@@ -770,19 +777,22 @@ 
 					BUG();
 				}
 				spin_lock(&mm->page_table_lock);
+				if (!map) 
+					break; 
 			}
 			if (pages) {
 				pages[i] = get_page_map(map);
-				if (!pages[i]) {
+				if (!pages[i] && force != 2) {
 					spin_unlock(&mm->page_table_lock);
 					while (i--)
 						page_cache_release(pages[i]);
 					i = -EFAULT;
 					goto out;
+				} else { 
+					flush_dcache_page(pages[i]);
+					if (!PageReserved(pages[i]))
+						page_cache_get(pages[i]);
 				}
-				flush_dcache_page(pages[i]);
-				if (!PageReserved(pages[i]))
-					page_cache_get(pages[i]);
 			}
 			if (vmas)
 				vmas[i] = vma;
@@ -1655,7 +1665,7 @@ 
 	return pmd_offset(pgd, address);
 }
 
-int make_pages_present(unsigned long addr, unsigned long end)
+int make_pages_present(unsigned long addr, unsigned long end, int tolerant)
 {
 	int ret, len, write;
 	struct vm_area_struct * vma;
@@ -1668,7 +1678,7 @@ 
 		BUG();
 	len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
 	ret = get_user_pages(current, current->mm, addr,
-			len, write, 0, NULL, NULL);
+			len, write, tolerant ? 2 : 0, NULL, NULL);
 	if (ret < 0)
 		return ret;
 	return ret == len ? 0 : -1;
diff -u linux-test7-work/mm/mlock.c-o linux-test7-work/mm/mlock.c
--- linux-test7-work/mm/mlock.c-o	2003-09-28 10:53:25.000000000 +0200
+++ linux-test7-work/mm/mlock.c	2003-12-04 10:58:22.000000000 +0100
@@ -10,7 +10,8 @@ 
 
 
 static int mlock_fixup(struct vm_area_struct * vma, 
-	unsigned long start, unsigned long end, unsigned int newflags)
+	unsigned long start, unsigned long end, unsigned int newflags,
+	int tolerant)
 {
 	struct mm_struct * mm = vma->vm_mm;
 	int pages;
@@ -43,7 +44,7 @@ 
 	pages = (end - start) >> PAGE_SHIFT;
 	if (newflags & VM_LOCKED) {
 		pages = -pages;
-		ret = make_pages_present(start, end);
+		ret = make_pages_present(start, end, tolerant);
 	}
 
 	vma->vm_mm->locked_vm -= pages;
@@ -79,13 +80,13 @@ 
 			newflags &= ~VM_LOCKED;
 
 		if (vma->vm_end >= end) {
-			error = mlock_fixup(vma, nstart, end, newflags);
+			error = mlock_fixup(vma, nstart, end, newflags, 0);
 			break;
 		}
 
 		tmp = vma->vm_end;
 		next = vma->vm_next;
-		error = mlock_fixup(vma, nstart, tmp, newflags);
+		error = mlock_fixup(vma, nstart, tmp, newflags, 0);
 		if (error)
 			break;
 		nstart = tmp;
@@ -154,7 +155,7 @@ 
 		newflags = vma->vm_flags | VM_LOCKED;
 		if (!(flags & MCL_CURRENT))
 			newflags &= ~VM_LOCKED;
-		error = mlock_fixup(vma, vma->vm_start, vma->vm_end, newflags);
+		error = mlock_fixup(vma, vma->vm_start, vma->vm_end, newflags, 1);
 		if (error)
 			break;
 	}
diff -u linux-test7-work/mm/mremap.c-o linux-test7-work/mm/mremap.c
--- linux-test7-work/mm/mremap.c-o	2003-09-11 04:12:42.000000000 +0200
+++ linux-test7-work/mm/mremap.c	2003-12-04 11:00:25.000000000 +0100
@@ -281,7 +281,7 @@ 
 			current->mm->locked_vm += new_len >> PAGE_SHIFT;
 			if (new_len > old_len)
 				make_pages_present(new_addr + old_len,
-						   new_addr + new_len);
+						   new_addr + new_len, 2);
 		}
 		return new_addr;
 	}
@@ -405,7 +405,7 @@ 
 			if (vma->vm_flags & VM_LOCKED) {
 				current->mm->locked_vm += pages;
 				make_pages_present(addr + old_len,
-						   addr + new_len);
+						   addr + new_len, 2);
 			}
 			ret = addr;
 			goto out;
diff -u linux-test7-work/mm/mmap.c-o linux-test7-work/mm/mmap.c
--- linux-test7-work/mm/mmap.c-o	2003-10-09 00:29:01.000000000 +0200
+++ linux-test7-work/mm/mmap.c	2003-12-04 11:02:17.000000000 +0100
@@ -655,7 +655,7 @@ 
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		make_pages_present(addr, addr + len, 2);
 	}
 	if (flags & MAP_POPULATE) {
 		up_write(&mm->mmap_sem);
@@ -910,7 +910,7 @@ 
 	if (!prev || expand_stack(prev, addr))
 		return NULL;
 	if (prev->vm_flags & VM_LOCKED) {
-		make_pages_present(addr, prev->vm_end);
+		make_pages_present(addr, prev->vm_end, 2);
 	}
 	return prev;
 }
@@ -971,7 +971,7 @@ 
 	if (expand_stack(vma, addr))
 		return NULL;
 	if (vma->vm_flags & VM_LOCKED) {
-		make_pages_present(addr, start);
+		make_pages_present(addr, start, 2);
 	}
 	return vma;
 }
@@ -1345,7 +1345,7 @@ 
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		make_pages_present(addr, addr + len, 2);
 	}
 	return addr;
 }
diff -u linux-test7-work/include/linux/mm.h-o linux-test7-work/include/linux/mm.h
--- linux-test7-work/include/linux/mm.h-o	2003-10-09 00:29:00.000000000 +0200
+++ linux-test7-work/include/linux/mm.h	2003-12-04 11:00:25.000000000 +0100
@@ -433,7 +433,7 @@ 
 extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
 extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
 extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access);
-extern int make_pages_present(unsigned long addr, unsigned long end);
+extern int make_pages_present(unsigned long addr, unsigned long end, int tolerant);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 extern long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long nonblock);
 extern long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);