All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
       [not found]         ` <CA+55aFyzy_wYHHnr2gDcYr7qcgOKM2557bRdg6RBa=cxrynd+Q@mail.gmail.com>
@ 2015-01-27 20:57           ` Linus Torvalds
       [not found]             ` <CA+55aFxRnj97rpSQvvzLJhpo7C8TQ-F=eB1Ry2n53AV1rN8mwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-29  6:59             ` Max Filippov
       [not found]           ` <CA+55aFyzy_wYHHnr2gDcYr7qcgOKM2557bRdg6RBa=cxrynd+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-01-27 20:57 UTC (permalink / raw)
  To: Takashi Iwai, linux-arch; +Cc: opensuse-factory, OpenSUSE Kernel Team

[-- Attachment #1: Type: text/plain, Size: 1654 bytes --]

[ Adding 'linux-arch' to the recipients, since this touches pretty
much all architectures ]

Background for arch people: it seems that a few applications really
care about the difference between SIGSEGV and SIGBUS, but the generic
VM layer currently has no way to say "this access should generate a
SIGSEGV". We have that VM_FAULT_SIGBUS, but no equivalent
VM_FAULT_SIGSEGV.

So when the stack limit fix went in, I used VM_FAULT_SIGBUS, and a
couple of apps noticed that the stack rlimit violation changed from
SIGSEGV to SIGBUS as a result.

It's actually sad that this whole error handling is duplicated all
over every architecture, but oh well. This is a completely mindless
patch to add VM_FAULT_SIGSEGV.

Some architectures aren't affected, for the simple reason that they
already ended up returning SIGSEGV for non-SIGBUS errors. Most other
architectures had a BUG_ON() for the unrecognized case, and just need
a trivial "if (fault & VM_FAULT_SIGSEGV) goto bad_area;"

And then some architectures had a different pattern, and I tried to
fix it up as straightforwardly as possible, but I could easily have
screwed up.

Can people take a look?

On Tue, Jan 27, 2015 at 12:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Very annoying. The patch would look something like the attached -
> TOTALLY UNTESTED.

Actually, I missed a couple of places in mm/gup.c and mm/ksm.c (and
one in lustre, although that one just uses filemap_fault, so it never
triggers the stack case, but for completeness).

So this would be the more complete patch. Still totally untested. I
may have screwed up something obvious.

                         Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 15607 bytes --]

 arch/alpha/mm/fault.c                        | 2 ++
 arch/arc/mm/fault.c                          | 2 ++
 arch/avr32/mm/fault.c                        | 2 ++
 arch/cris/mm/fault.c                         | 2 ++
 arch/frv/mm/fault.c                          | 2 ++
 arch/ia64/mm/fault.c                         | 2 ++
 arch/m32r/mm/fault.c                         | 2 ++
 arch/m68k/mm/fault.c                         | 2 ++
 arch/metag/mm/fault.c                        | 2 ++
 arch/microblaze/mm/fault.c                   | 2 ++
 arch/mips/mm/fault.c                         | 2 ++
 arch/mn10300/mm/fault.c                      | 2 ++
 arch/nios2/mm/fault.c                        | 2 ++
 arch/openrisc/mm/fault.c                     | 2 ++
 arch/parisc/mm/fault.c                       | 2 ++
 arch/powerpc/mm/copro_fault.c                | 3 +++
 arch/powerpc/mm/fault.c                      | 2 ++
 arch/s390/mm/fault.c                         | 6 ++++++
 arch/score/mm/fault.c                        | 2 ++
 arch/sh/mm/fault.c                           | 2 ++
 arch/sparc/mm/fault_32.c                     | 2 ++
 arch/sparc/mm/fault_64.c                     | 2 ++
 arch/tile/mm/fault.c                         | 2 ++
 arch/um/kernel/trap.c                        | 2 ++
 arch/x86/mm/fault.c                          | 2 ++
 arch/xtensa/mm/fault.c                       | 2 ++
 drivers/staging/lustre/lustre/llite/vvp_io.c | 2 +-
 include/linux/mm.h                           | 6 ++++--
 mm/gup.c                                     | 4 ++--
 mm/ksm.c                                     | 2 +-
 30 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 98838a05ba6d..9d0ac091a52a 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -156,6 +156,8 @@ retry:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 6f7e3a68803a..0f8df3b5b1b3 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -161,6 +161,8 @@ good_area:
 
 	if (fault & VM_FAULT_OOM)
 		goto out_of_memory;
+	else if (fault & VM_FAULT_SIGSEV)
+		goto bad_area;
 	else if (fault & VM_FAULT_SIGBUS)
 		goto do_sigbus;
 
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index 0eca93327195..d223a8b57c1e 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -142,6 +142,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 1790f22e71a2..2686a7aa8ec8 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -176,6 +176,8 @@ retry:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 9a66372fc7c7..ec4917ddf678 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -168,6 +168,8 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 7225dad87094..ba5ba7accd0d 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -172,6 +172,8 @@ retry:
 		 */
 		if (fault & VM_FAULT_OOM) {
 			goto out_of_memory;
+		} else if (fault & VM_FAULT_SIGSEGV) {
+			goto bad_area;
 		} else if (fault & VM_FAULT_SIGBUS) {
 			signal = SIGBUS;
 			goto bad_area;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index e9c6a8014bd6..e3d4d4890104 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -200,6 +200,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 2bd7487440c4..b2f04aee46ec 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -145,6 +145,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto map_err;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto bus_err;
 		BUG();
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index 332680e5ebf2..2de5dc695a87 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -141,6 +141,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index fa4cf52aa7a6..d46a5ebb7570 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -224,6 +224,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index becc42bb1849..70ab5d664332 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -158,6 +158,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 3516cbdf1ee9..0c2cc5d39c8e 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -262,6 +262,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 15a0bb5fc06d..34429d5a0ccd 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -135,6 +135,8 @@ survive:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 0703acf7d327..0a0c5fa1a64f 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -171,6 +171,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (faulr & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 3ca9c1131cfe..e5120e653240 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -256,6 +256,8 @@ good_area:
 		 */
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto bad_area;
 		BUG();
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f082c78..094b7d445023 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -79,6 +79,9 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
 		} else if (*flt & VM_FAULT_SIGBUS) {
 			ret = -EFAULT;
 			goto out_unlock;
+		} else if (*flt & VM_FAULT_SIGBUS) {
+			ret = -EFAULT;
+			goto out_unlock;
 		}
 		BUG();
 	}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index eb79907f34fa..6154b0a2b063 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -437,6 +437,8 @@ good_area:
 	 */
 	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
+		if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		rc = mm_fault_error(regs, address, fault);
 		if (rc >= MM_FAULT_RETURN)
 			goto bail;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 811937bb90be..9065d5aa3932 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -374,6 +374,12 @@ static noinline void do_fault_error(struct pt_regs *regs, int fault)
 				do_no_context(regs);
 			else
 				pagefault_out_of_memory();
+		} else if (fault & VM_FAULT_SIGSEGV) {
+			/* Kernel mode? Handle exceptions or die */
+			if (!user_mode(regs))
+				do_no_context(regs);
+			else
+				do_sigsegv(regs, SEGV_MAPERR);
 		} else if (fault & VM_FAULT_SIGBUS) {
 			/* Kernel mode? Handle exceptions or die */
 			if (!user_mode(regs))
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 52238983527d..6860beb2a280 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -114,6 +114,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 541dc6101508..a58fec9b55e0 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -353,6 +353,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	} else {
 		if (fault & VM_FAULT_SIGBUS)
 			do_sigbus(regs, error_code, address);
+		else if (fault & VM_FAULT_SIGSEGV)
+			bad_area(regs, error_code, address);
 		else
 			BUG();
 	}
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 908e8c17c902..70d817154fe8 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -249,6 +249,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 18fcd7167095..479823249429 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -446,6 +446,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 565e25a98334..0f61a73534e6 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -442,6 +442,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 5678c3571e7c..209617302df8 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -80,6 +80,8 @@ good_area:
 		if (unlikely(fault & VM_FAULT_ERROR)) {
 			if (fault & VM_FAULT_OOM) {
 				goto out_of_memory;
+			} else if (fault & VM_FAULT_SIGSEGV) {
+				goto out;
 			} else if (fault & VM_FAULT_SIGBUS) {
 				err = -EACCES;
 				goto out;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 38dcec403b46..e3ff27a5b634 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -898,6 +898,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
 			do_sigbus(regs, error_code, address, fault);
+		else if (fault & VM_FAULT_SIGSEGV)
+			bad_area_nosemaphore(regs, error_code, address);
 		else
 			BUG();
 	}
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index b57c4f91f487..9e3571a6535c 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -117,6 +117,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/drivers/staging/lustre/lustre/llite/vvp_io.c b/drivers/staging/lustre/lustre/llite/vvp_io.c
index 930f6010203e..65d610abe06e 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_io.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_io.c
@@ -632,7 +632,7 @@ static int vvp_io_kernel_fault(struct vvp_fault_io *cfio)
 		return 0;
 	}
 
-	if (cfio->fault.ft_flags & VM_FAULT_SIGBUS) {
+	if (cfio->fault.ft_flags & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) {
 		CDEBUG(D_PAGE, "got addr %p - SIGBUS\n", vmf->virtual_address);
 		return -EFAULT;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80fc92a49649..dd5ea3016fc4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1070,6 +1070,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
+#define VM_FAULT_SIGSEGV 0x0040
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
@@ -1078,8 +1079,9 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
-			 VM_FAULT_FALLBACK | VM_FAULT_HWPOISON_LARGE)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
+			 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
+			 VM_FAULT_FALLBACK)
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/mm/gup.c b/mm/gup.c
index a900759cc807..8dd50ce6326f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -296,7 +296,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 			return -ENOMEM;
 		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
 			return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
-		if (ret & VM_FAULT_SIGBUS)
+		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
 			return -EFAULT;
 		BUG();
 	}
@@ -571,7 +571,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			return -ENOMEM;
 		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
 			return -EHWPOISON;
-		if (ret & VM_FAULT_SIGBUS)
+		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
 			return -EFAULT;
 		BUG();
 	}
diff --git a/mm/ksm.c b/mm/ksm.c
index d247efab5073..15647fb0394f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -376,7 +376,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
-	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_OOM)));
+	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM)));
 	/*
 	 * We must loop because handle_mm_fault() may back out if there's
 	 * any difficulty e.g. if pte accessed bit gets updated concurrently.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: libsigsegv build fail with kernel 3.18.3
@ 2015-01-27 21:12               ` Jan Engelhardt
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Engelhardt @ 2015-01-27 21:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Takashi Iwai, opensuse-factory-stAJ6ESoqRxg9hUCZPvPmw,
	OpenSUSE Kernel Team, linux-arch-u79uwXL29TY76Z2rM5mHXA


On Tuesday 2015-01-27 21:36, Linus Torvalds wrote:
>The only reason it returns SIGBUS is that we don't have an equivalent
>VM_FAULT_SIGSEGV error return.  Which is something of an oversight,
>but adding it is a pain too, simply because the callers are mainly the
>architecture page fault handlers, so the VM_FAULT_SIGSEGV handling has
>to be added to all of them.
>
>Very annoying. The patch would look something like the attached -
>TOTALLY UNTESTED.
>
>Does this fix things for you?

The 2nd patch (from 12:57:20 -0800) plus the one-liner for mm/memory.c
at the callsite for check_stack_guard_page makes the test still 
not succeed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
@ 2015-01-27 21:12               ` Jan Engelhardt
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Engelhardt @ 2015-01-27 21:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Takashi Iwai, opensuse-factory, OpenSUSE Kernel Team, linux-arch


On Tuesday 2015-01-27 21:36, Linus Torvalds wrote:
>The only reason it returns SIGBUS is that we don't have an equivalent
>VM_FAULT_SIGSEGV error return.  Which is something of an oversight,
>but adding it is a pain too, simply because the callers are mainly the
>architecture page fault handlers, so the VM_FAULT_SIGSEGV handling has
>to be added to all of them.
>
>Very annoying. The patch would look something like the attached -
>TOTALLY UNTESTED.
>
>Does this fix things for you?

The 2nd patch (from 12:57:20 -0800) plus the one-liner for mm/memory.c
at the callsite for check_stack_guard_page makes the test still 
not succeed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-27 21:12               ` [opensuse-factory] Re: [opensuse-kernel] " Jan Engelhardt
  (?)
@ 2015-01-27 21:32               ` Linus Torvalds
  2015-01-27 22:14                 ` Jan Engelhardt
  -1 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2015-01-27 21:32 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Takashi Iwai, opensuse-factory, OpenSUSE Kernel Team, linux-arch

On Tue, Jan 27, 2015 at 1:12 PM, Jan Engelhardt <jengelh@inai.de> wrote:
>
> The 2nd patch (from 12:57:20 -0800) plus the one-liner for mm/memory.c
> at the callsite for check_stack_guard_page makes the test still
> not succeed.

Really? What platform? I get

  ==================
  All 5 tests passed
  ==================

with the patch I sent out, combined with this:

   diff --git a/mm/memory.c b/mm/memory.c
   index 54f3a9b00956..2c3536cc6c63 100644
   --- a/mm/memory.c
   +++ b/mm/memory.c
   @@ -2632,7 +2632,7 @@ static int do_anonymous_page(struct mm_struct
*mm, struct vm_area_struct *vma,

           /* Check if we need to add a guard page to the stack */
           if (check_stack_guard_page(vma, address) < 0)
   -               return VM_FAULT_SIGBUS;
   +               return VM_FAULT_SIGSEGV;

           /* Use the zero-page for reads */
           if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {

but since architecture matters, and I run on x86-64, some other
platform might still have issues.

(And yes, I checked that it failed without the patch, so I think I'm
testing the same thing you guys are)

                         Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-27 21:32               ` Linus Torvalds
@ 2015-01-27 22:14                 ` Jan Engelhardt
  2015-01-27 22:32                   ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Engelhardt @ 2015-01-27 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Takashi Iwai, opensuse-factory, OpenSUSE Kernel Team, linux-arch


On Tuesday 2015-01-27 22:32, Linus Torvalds wrote:
>On Tue, Jan 27, 2015 at 1:12 PM, Jan Engelhardt <jengelh@inai.de> wrote:
>>
>> The 2nd patch (from 12:57:20 -0800) plus the one-liner for mm/memory.c
>> at the callsite for check_stack_guard_page makes the test still
>> not succeed.
>
>Really? What platform? I get "all 5 tests passed"
>with the patch I sent out, combined with this:
>   diff --git a/mm/memory.c b/mm/memory.c
>   index 54f3a9b00956..2c3536cc6c63 100644

Turns out I missed a make call in the procedure  *blush*.
Sometimes I wish `make modules_install` would run the compile stage as well..
just like `make install` does in autotooled projects.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-27 22:14                 ` Jan Engelhardt
@ 2015-01-27 22:32                   ` Linus Torvalds
  2015-01-27 23:13                     ` Jan Engelhardt
                                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-01-27 22:32 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Takashi Iwai, opensuse-factory, OpenSUSE Kernel Team, linux-arch

On Tue, Jan 27, 2015 at 2:14 PM, Jan Engelhardt <jengelh@inai.de> wrote:
>
> Sometimes I wish `make modules_install` would run the compile stage as well..
> just like `make install` does in autotooled projects.

This is just one more reason to hate autotools. It's shit.

You should *never* compile things as root. It's not just a bad idea in
that it leaves possible root-owned files in your project (which in
turn can make cleanup harder etc), but it is historically also an
actual security issue with the whole temporary file mess in /tmp and
various racy symlink-attacks.

So in general, you should try to teach people that "make install"
should never ever build anything. Because if it does, it's just a
broken code-flow. It may work, it may be convenient, and with proper
tools it _may_ even be secure, but it's still a really bad idea.

Sadly, I seem to be alone in my hatred of automake/configure/etc.

                          Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-27 22:32                   ` Linus Torvalds
@ 2015-01-27 23:13                     ` Jan Engelhardt
  2015-01-27 23:53                     ` David Miller
       [not found]                     ` <CA+55aFzguEFfG2REN1soMC+0UJ7GtANfEvMoCNPt0QqmP9LKoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 0 replies; 43+ messages in thread
From: Jan Engelhardt @ 2015-01-27 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Takashi Iwai, opensuse-factory, OpenSUSE Kernel Team, linux-arch


On Tuesday 2015-01-27 23:32, Linus Torvalds wrote:
>On Tue, Jan 27, 2015 at 2:14 PM, Jan Engelhardt <jengelh@inai.de> wrote:
>>
>> Sometimes I wish `make modules_install` would run the compile stage as well..
>> just like `make install` does in autotooled projects.
>
>You should *never* compile things as root.

I certainly agree with you on that. So perhaps make throwing an error
instead would be helpful here - "hey there are uncompiled things
left, modules_install would give you nonmatching files".


>Sadly, I seem to be alone in my hatred of automake/configure/etc.

(Look no further than netdev if you need company.)

There are certainly points where it does not excel; however, in a 
distribution where we have to deal with not just one idiosyncratic build 
system, but 15+, it is the preferred choice to have especially if and 
when it comes down having to edit the build instructions for one reason 
or another.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-27 22:32                   ` Linus Torvalds
  2015-01-27 23:13                     ` Jan Engelhardt
@ 2015-01-27 23:53                     ` David Miller
       [not found]                     ` <CA+55aFzguEFfG2REN1soMC+0UJ7GtANfEvMoCNPt0QqmP9LKoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2015-01-27 23:53 UTC (permalink / raw)
  To: torvalds; +Cc: jengelh, tiwai, opensuse-factory, opensuse-kernel, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 27 Jan 2015 14:32:11 -0800

> Sadly, I seem to be alone in my hatred of automake/configure/etc.

You're not, there are even I Hate Autoconf t-shirts.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: libsigsegv build fail with kernel 3.18.3
@ 2015-01-28  7:38                 ` Heiko Carstens
  0 siblings, 0 replies; 43+ messages in thread
From: Heiko Carstens @ 2015-01-28  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Takashi Iwai, linux-arch-u79uwXL29TY76Z2rM5mHXA,
	opensuse-factory-stAJ6ESoqRxg9hUCZPvPmw, OpenSUSE Kernel Team

On Tue, Jan 27, 2015 at 12:57:20PM -0800, Linus Torvalds wrote:
> [ Adding 'linux-arch' to the recipients, since this touches pretty
> much all architectures ]
[...]
> Can people take a look?
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index 811937bb90be..9065d5aa3932 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -374,6 +374,12 @@ static noinline void do_fault_error(struct pt_regs *regs, int fault)
>  				do_no_context(regs);
>  			else
>  				pagefault_out_of_memory();
> +		} else if (fault & VM_FAULT_SIGSEGV) {
> +			/* Kernel mode? Handle exceptions or die */
> +			if (!user_mode(regs))
> +				do_no_context(regs);
> +			else
> +				do_sigsegv(regs, SEGV_MAPERR);

s390 still compiles and boots with this patch applied.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
@ 2015-01-28  7:38                 ` Heiko Carstens
  0 siblings, 0 replies; 43+ messages in thread
From: Heiko Carstens @ 2015-01-28  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Takashi Iwai, linux-arch, opensuse-factory, OpenSUSE Kernel Team

On Tue, Jan 27, 2015 at 12:57:20PM -0800, Linus Torvalds wrote:
> [ Adding 'linux-arch' to the recipients, since this touches pretty
> much all architectures ]
[...]
> Can people take a look?
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index 811937bb90be..9065d5aa3932 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -374,6 +374,12 @@ static noinline void do_fault_error(struct pt_regs *regs, int fault)
>  				do_no_context(regs);
>  			else
>  				pagefault_out_of_memory();
> +		} else if (fault & VM_FAULT_SIGSEGV) {
> +			/* Kernel mode? Handle exceptions or die */
> +			if (!user_mode(regs))
> +				do_no_context(regs);
> +			else
> +				do_sigsegv(regs, SEGV_MAPERR);

s390 still compiles and boots with this patch applied.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: libsigsegv build fail with kernel 3.18.3
@ 2015-01-28  8:48                         ` Andreas Schwab
  0 siblings, 0 replies; 43+ messages in thread
From: Andreas Schwab @ 2015-01-28  8:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Engelhardt, Takashi Iwai,
	opensuse-factory-stAJ6ESoqRxg9hUCZPvPmw, OpenSUSE Kernel Team,
	linux-arch@vger.kernel.org

Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> writes:

> So in general, you should try to teach people that "make install"
> should never ever build anything. Because if it does, it's just a
> broken code-flow. It may work, it may be convenient, and with proper
> tools it _may_ even be secure, but it's still a really bad idea.

# make modules_install
Makefile:657: Cannot use CONFIG_CC_STACKPROTECTOR_REGULAR: -fstack-protector not supported by compiler
[...]

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab-l3A5Bk7waGM@public.gmane.org
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
@ 2015-01-28  8:48                         ` Andreas Schwab
  0 siblings, 0 replies; 43+ messages in thread
From: Andreas Schwab @ 2015-01-28  8:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Engelhardt, Takashi Iwai, opensuse-factory,
	OpenSUSE Kernel Team, linux-arch

Linus Torvalds <torvalds@linux-foundation.org> writes:

> So in general, you should try to teach people that "make install"
> should never ever build anything. Because if it does, it's just a
> broken code-flow. It may work, it may be convenient, and with proper
> tools it _may_ even be secure, but it's still a really bad idea.

# make modules_install
Makefile:657: Cannot use CONFIG_CC_STACKPROTECTOR_REGULAR: -fstack-protector not supported by compiler
[...]

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-27 20:57           ` [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3 Linus Torvalds
       [not found]             ` <CA+55aFxRnj97rpSQvvzLJhpo7C8TQ-F=eB1Ry2n53AV1rN8mwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-29  6:59             ` Max Filippov
  2015-01-29 18:16               ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Max Filippov @ 2015-01-29  6:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Takashi Iwai, linux-arch, opensuse-factory, OpenSUSE Kernel Team

On Tue, Jan 27, 2015 at 11:57 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So this would be the more complete patch. Still totally untested. I
> may have screwed up something obvious.

Condition in arch/powerpc/mm/copro_fault.c hunk doesn't look right.

-- 
Thanks.
-- Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-29  6:59             ` Max Filippov
@ 2015-01-29 18:16               ` Linus Torvalds
  2015-02-02  0:23                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2015-01-29 18:16 UTC (permalink / raw)
  To: Max Filippov
  Cc: Takashi Iwai, linux-arch, opensuse-factory, OpenSUSE Kernel Team

[-- Attachment #1: Type: text/plain, Size: 416 bytes --]

On Wed, Jan 28, 2015 at 10:59 PM, Max Filippov <jcmvbkbc@gmail.com> wrote:
>
> Condition in arch/powerpc/mm/copro_fault.c hunk doesn't look right.

Yes. There was a typo too ("faulr" instead of "fault" in one arch). I
didn't bother sending out my minor updated version because my fixes
were tiny, but since somebody actually went through the old version,
here is my updated one.

                              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 16178 bytes --]

 arch/alpha/mm/fault.c                        | 2 ++
 arch/arc/mm/fault.c                          | 2 ++
 arch/avr32/mm/fault.c                        | 2 ++
 arch/cris/mm/fault.c                         | 2 ++
 arch/frv/mm/fault.c                          | 2 ++
 arch/ia64/mm/fault.c                         | 2 ++
 arch/m32r/mm/fault.c                         | 2 ++
 arch/m68k/mm/fault.c                         | 2 ++
 arch/metag/mm/fault.c                        | 2 ++
 arch/microblaze/mm/fault.c                   | 2 ++
 arch/mips/mm/fault.c                         | 2 ++
 arch/mn10300/mm/fault.c                      | 2 ++
 arch/nios2/mm/fault.c                        | 2 ++
 arch/openrisc/mm/fault.c                     | 2 ++
 arch/parisc/mm/fault.c                       | 2 ++
 arch/powerpc/mm/copro_fault.c                | 2 +-
 arch/powerpc/mm/fault.c                      | 2 ++
 arch/s390/mm/fault.c                         | 6 ++++++
 arch/score/mm/fault.c                        | 2 ++
 arch/sh/mm/fault.c                           | 2 ++
 arch/sparc/mm/fault_32.c                     | 2 ++
 arch/sparc/mm/fault_64.c                     | 2 ++
 arch/tile/mm/fault.c                         | 2 ++
 arch/um/kernel/trap.c                        | 2 ++
 arch/x86/mm/fault.c                          | 2 ++
 arch/xtensa/mm/fault.c                       | 2 ++
 drivers/staging/lustre/lustre/llite/vvp_io.c | 2 +-
 include/linux/mm.h                           | 6 ++++--
 mm/gup.c                                     | 4 ++--
 mm/ksm.c                                     | 2 +-
 mm/memory.c                                  | 2 +-
 31 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 98838a05ba6d..9d0ac091a52a 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -156,6 +156,8 @@ retry:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 6f7e3a68803a..0f8df3b5b1b3 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -161,6 +161,8 @@ good_area:
 
 	if (fault & VM_FAULT_OOM)
 		goto out_of_memory;
+	else if (fault & VM_FAULT_SIGSEV)
+		goto bad_area;
 	else if (fault & VM_FAULT_SIGBUS)
 		goto do_sigbus;
 
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index 0eca93327195..d223a8b57c1e 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -142,6 +142,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 1790f22e71a2..2686a7aa8ec8 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -176,6 +176,8 @@ retry:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 9a66372fc7c7..ec4917ddf678 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -168,6 +168,8 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 7225dad87094..ba5ba7accd0d 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -172,6 +172,8 @@ retry:
 		 */
 		if (fault & VM_FAULT_OOM) {
 			goto out_of_memory;
+		} else if (fault & VM_FAULT_SIGSEGV) {
+			goto bad_area;
 		} else if (fault & VM_FAULT_SIGBUS) {
 			signal = SIGBUS;
 			goto bad_area;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index e9c6a8014bd6..e3d4d4890104 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -200,6 +200,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 2bd7487440c4..b2f04aee46ec 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -145,6 +145,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto map_err;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto bus_err;
 		BUG();
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index 332680e5ebf2..2de5dc695a87 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -141,6 +141,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index fa4cf52aa7a6..d46a5ebb7570 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -224,6 +224,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index becc42bb1849..70ab5d664332 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -158,6 +158,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 3516cbdf1ee9..0c2cc5d39c8e 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -262,6 +262,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 15a0bb5fc06d..34429d5a0ccd 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -135,6 +135,8 @@ survive:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 0703acf7d327..230ac20ae794 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -171,6 +171,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 3ca9c1131cfe..e5120e653240 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -256,6 +256,8 @@ good_area:
 		 */
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto bad_area;
 		BUG();
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f082c78..1b5305d4bdab 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -76,7 +76,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
 		if (*flt & VM_FAULT_OOM) {
 			ret = -ENOMEM;
 			goto out_unlock;
-		} else if (*flt & VM_FAULT_SIGBUS) {
+		} else if (*flt & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) {
 			ret = -EFAULT;
 			goto out_unlock;
 		}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index eb79907f34fa..6154b0a2b063 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -437,6 +437,8 @@ good_area:
 	 */
 	fault = handle_mm_fault(mm, vma, address, flags);
 	if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
+		if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		rc = mm_fault_error(regs, address, fault);
 		if (rc >= MM_FAULT_RETURN)
 			goto bail;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 811937bb90be..9065d5aa3932 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -374,6 +374,12 @@ static noinline void do_fault_error(struct pt_regs *regs, int fault)
 				do_no_context(regs);
 			else
 				pagefault_out_of_memory();
+		} else if (fault & VM_FAULT_SIGSEGV) {
+			/* Kernel mode? Handle exceptions or die */
+			if (!user_mode(regs))
+				do_no_context(regs);
+			else
+				do_sigsegv(regs, SEGV_MAPERR);
 		} else if (fault & VM_FAULT_SIGBUS) {
 			/* Kernel mode? Handle exceptions or die */
 			if (!user_mode(regs))
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 52238983527d..6860beb2a280 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -114,6 +114,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 541dc6101508..a58fec9b55e0 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -353,6 +353,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	} else {
 		if (fault & VM_FAULT_SIGBUS)
 			do_sigbus(regs, error_code, address);
+		else if (fault & VM_FAULT_SIGSEGV)
+			bad_area(regs, error_code, address);
 		else
 			BUG();
 	}
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 908e8c17c902..70d817154fe8 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -249,6 +249,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 18fcd7167095..479823249429 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -446,6 +446,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 565e25a98334..0f61a73534e6 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -442,6 +442,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 5678c3571e7c..209617302df8 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -80,6 +80,8 @@ good_area:
 		if (unlikely(fault & VM_FAULT_ERROR)) {
 			if (fault & VM_FAULT_OOM) {
 				goto out_of_memory;
+			} else if (fault & VM_FAULT_SIGSEGV) {
+				goto out;
 			} else if (fault & VM_FAULT_SIGBUS) {
 				err = -EACCES;
 				goto out;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 38dcec403b46..e3ff27a5b634 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -898,6 +898,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
 			do_sigbus(regs, error_code, address, fault);
+		else if (fault & VM_FAULT_SIGSEGV)
+			bad_area_nosemaphore(regs, error_code, address);
 		else
 			BUG();
 	}
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index b57c4f91f487..9e3571a6535c 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -117,6 +117,8 @@ good_area:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
+		else if (fault & VM_FAULT_SIGSEGV)
+			goto bad_area;
 		else if (fault & VM_FAULT_SIGBUS)
 			goto do_sigbus;
 		BUG();
diff --git a/drivers/staging/lustre/lustre/llite/vvp_io.c b/drivers/staging/lustre/lustre/llite/vvp_io.c
index 930f6010203e..65d610abe06e 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_io.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_io.c
@@ -632,7 +632,7 @@ static int vvp_io_kernel_fault(struct vvp_fault_io *cfio)
 		return 0;
 	}
 
-	if (cfio->fault.ft_flags & VM_FAULT_SIGBUS) {
+	if (cfio->fault.ft_flags & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) {
 		CDEBUG(D_PAGE, "got addr %p - SIGBUS\n", vmf->virtual_address);
 		return -EFAULT;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80fc92a49649..dd5ea3016fc4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1070,6 +1070,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
+#define VM_FAULT_SIGSEGV 0x0040
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
@@ -1078,8 +1079,9 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
-			 VM_FAULT_FALLBACK | VM_FAULT_HWPOISON_LARGE)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
+			 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
+			 VM_FAULT_FALLBACK)
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/mm/gup.c b/mm/gup.c
index a900759cc807..8dd50ce6326f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -296,7 +296,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 			return -ENOMEM;
 		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
 			return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
-		if (ret & VM_FAULT_SIGBUS)
+		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
 			return -EFAULT;
 		BUG();
 	}
@@ -571,7 +571,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			return -ENOMEM;
 		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
 			return -EHWPOISON;
-		if (ret & VM_FAULT_SIGBUS)
+		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
 			return -EFAULT;
 		BUG();
 	}
diff --git a/mm/ksm.c b/mm/ksm.c
index d247efab5073..15647fb0394f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -376,7 +376,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
-	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_OOM)));
+	} while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM)));
 	/*
 	 * We must loop because handle_mm_fault() may back out if there's
 	 * any difficulty e.g. if pte accessed bit gets updated concurrently.
diff --git a/mm/memory.c b/mm/memory.c
index 54f3a9b00956..2c3536cc6c63 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2632,7 +2632,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Check if we need to add a guard page to the stack */
 	if (check_stack_guard_page(vma, address) < 0)
-		return VM_FAULT_SIGBUS;
+		return VM_FAULT_SIGSEGV;
 
 	/* Use the zero-page for reads */
 	if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-01-29 18:16               ` Linus Torvalds
@ 2015-02-02  0:23                 ` Benjamin Herrenschmidt
  2015-02-02  1:09                   ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-02  0:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Max Filippov, Takashi Iwai, linux-arch, opensuse-factory,
	OpenSUSE Kernel Team

On Thu, 2015-01-29 at 10:16 -0800, Linus Torvalds wrote:
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index eb79907f34fa..6154b0a2b063 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -437,6 +437,8 @@ good_area:
>          */
>         fault = handle_mm_fault(mm, vma, address, flags);
>         if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
> +               if (fault & VM_FAULT_SIGSEGV)
> +                       goto bad_area;
>                 rc = mm_fault_error(regs, address, fault);
>                 if (rc >= MM_FAULT_RETURN)
>                         goto bail;

I prefer having the test inside mm_fault_error(), even if that makes the
patch a bit bigger, it keeps the logic in a single place. Untested
patch:

--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -184,6 +184,12 @@ static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault)
 		return MM_FAULT_RETURN;
 	}
 
+	/* Other faults */
+
+	if (fault & VM_FAULT_SIGSEGV) {
+		up_read(&current->mm->mmap_sem);
+		return MM_FAULT_ERR(SIGSEGV);
+	}
 	if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
 		return do_sigbus(regs, addr, fault);
 


Cheers,
Ben.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
  2015-02-02  0:23                 ` Benjamin Herrenschmidt
@ 2015-02-02  1:09                   ` Linus Torvalds
  2015-02-22 23:50                       ` [opensuse-factory] " Benjamin Herrenschmidt
  2015-02-28  7:12                     ` Generic page fault (Was: libsigsegv ....) Benjamin Herrenschmidt
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-02-02  1:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Max Filippov, Takashi Iwai, linux-arch, opensuse-factory,
	OpenSUSE Kernel Team

On Sun, Feb 1, 2015 at 4:23 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
>
> I prefer having the test inside mm_fault_error(), even if that makes the
> patch a bit bigger, it keeps the logic in a single place. Untested
> patch:

I'm certainly ok with that, but I wanted to make the code that I
wasn't going to compile (much less test) for various architectures be
as simple and straightforward as possible.

So feel free to send a patch that fixes it up to do it in a single
place after testing it.

Of course, what I *really* want would be to make a new
"generic_mm_fault()" helper that would do all the normal stuff:

 - find_vma()
 - check permissions and ranges
 - call 'handle_mm_fault()'
 - do the proper error, retry and minor/major fault handling

and then most architectures could just call that.

Everybody has their own thing that they do *before* they get the mmap
semaphore (x86 has at least kprobe, mmio tracing, vmalloc-fault,
spurious prefetch faults due to AMD CPU bugs, in-atomic debugging irq
re-enablement, etc etc). Those aren't even worth it trying to make
into generic things. But just looking at the original patch, a _lot_
of the stuff around the actual handle_mm_fault() call is very very
generic indeed, and I think it could be made into a generic helper
function.

For example, I not that long ago fixed the x86 handling of
VM_FAULT_RETRY when there was a fatal signal pending and we were
returning to kernel mode. The code would just return, "knowing" that
the fatal signal would mean that rather than faulting again, the
process would be killed. Except if the fault happened from kernel
space, that wouldn't happen, and it really would just fault forever.

I would not be surprised if somebody copied the old incorrect x86 case
for VM_FAULT_RETRY and fatal signals, and it just never got fixed
anywhere else.

Doing all that in a generic helper routine (that then just returns a
value saying "SIGBUS", "SIGSEGV" - we'd probably have to split it up
into "SIGSEGV due to access error" vs "SIGSEGV due to map error" to
make everybody happy, "out of memory" or "success" would have made it
trivial to add the whole VM_FAULT_SIGSEGV, but it would also mean that
everybody got fixes to the retry handling etc. Because right now
there's a *lot* of basically duplicated code (although powerpc and
s390 in particular have made some changes of their own).

Anybody willing to see if they could encapsulate that part of the x86
code, and make it more widely useful? I say "x86 code", because that's
the most tested one, and I think it gets the odd retry and error cases
right (and minor/major fault counting etc), unlike some.

                           Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
@ 2015-02-22 23:50                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-22 23:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Max Filippov, Takashi Iwai, linux-arch, opensuse-factory,
	OpenSUSE Kernel Team

On Sun, 2015-02-01 at 17:09 -0800, Linus Torvalds wrote:
> On Sun, Feb 1, 2015 at 4:23 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> >
> > I prefer having the test inside mm_fault_error(), even if that makes the
> > patch a bit bigger, it keeps the logic in a single place. Untested
> > patch:
> 
> I'm certainly ok with that, but I wanted to make the code that I
> wasn't going to compile (much less test) for various architectures be
> as simple and straightforward as possible.

Ah, I missed your reply ... my fault for using the wrong email address
to send my message in the first place :-)

> So feel free to send a patch that fixes it up to do it in a single
> place after testing it.

Ok sure, I'll have a look in the next few days, bogged down with some
local emergency right now.

> Of course, what I *really* want would be to make a new
> "generic_mm_fault()" helper that would do all the normal stuff:
> 
>  - find_vma()
>  - check permissions and ranges
>  - call 'handle_mm_fault()'
>  - do the proper error, retry and minor/major fault handling
> 
> and then most architectures could just call that.

That would be great ...

> Anybody willing to see if they could encapsulate that part of the x86
> code, and make it more widely useful? I say "x86 code", because that's
> the most tested one, and I think it gets the odd retry and error cases
> right (and minor/major fault counting etc), unlike some.

I can try to give it a spin some time this week I think, I can probably
do x86, powerpc and arm. Let's see if I manage to not forget :)

Cheers,
Ben.


>                            Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3
@ 2015-02-22 23:50                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-22 23:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Max Filippov, Takashi Iwai, linux-arch, opensuse-factory,
	OpenSUSE Kernel Team

On Sun, 2015-02-01 at 17:09 -0800, Linus Torvalds wrote:
> On Sun, Feb 1, 2015 at 4:23 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> >
> > I prefer having the test inside mm_fault_error(), even if that makes the
> > patch a bit bigger, it keeps the logic in a single place. Untested
> > patch:
> 
> I'm certainly ok with that, but I wanted to make the code that I
> wasn't going to compile (much less test) for various architectures be
> as simple and straightforward as possible.

Ah, I missed your reply ... my fault for using the wrong email address
to send my message in the first place :-)

> So feel free to send a patch that fixes it up to do it in a single
> place after testing it.

Ok sure, I'll have a look in the next few days, bogged down with some
local emergency right now.

> Of course, what I *really* want would be to make a new
> "generic_mm_fault()" helper that would do all the normal stuff:
> 
>  - find_vma()
>  - check permissions and ranges
>  - call 'handle_mm_fault()'
>  - do the proper error, retry and minor/major fault handling
> 
> and then most architectures could just call that.

That would be great ...

> Anybody willing to see if they could encapsulate that part of the x86
> code, and make it more widely useful? I say "x86 code", because that's
> the most tested one, and I think it gets the odd retry and error cases
> right (and minor/major fault counting etc), unlike some.

I can try to give it a spin some time this week I think, I can probably
do x86, powerpc and arm. Let's see if I manage to not forget :)

Cheers,
Ben.


>                            Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Generic page fault (Was: libsigsegv ....)
  2015-02-02  1:09                   ` Linus Torvalds
  2015-02-22 23:50                       ` [opensuse-factory] " Benjamin Herrenschmidt
@ 2015-02-28  7:12                     ` Benjamin Herrenschmidt
  2015-02-28  7:14                         ` Benjamin Herrenschmidt
  2015-02-28 19:56                         ` Linus Torvalds
  1 sibling, 2 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28  7:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

On Sun, 2015-02-01 at 17:09 -0800, Linus Torvalds wrote:
> 
> Of course, what I *really* want would be to make a new
> "generic_mm_fault()" helper that would do all the normal stuff:
> 
>  - find_vma()
>  - check permissions and ranges
>  - call 'handle_mm_fault()'
>  - do the proper error, retry and minor/major fault handling
> 
> and then most architectures could just call that.

So I spent a bit of time today while the kids were playing quietly (it
does happen !), and came up with this (very) draft pair of patches
for x86 and powerpc. It's by no mean a finished product as you can see,
but it shows how "messy" things get. Basically a bit more messy than I
originally thought but it's not *too* bad either.

Let me know what you think of the approach. It's boot tested on x86_64
in qemu and .

Next I think I'll tackle ARM, test a bit more, clean a few things up and
submit, but by all means, please provide feedback on the approach before
that :)

I'm attaching the patches for now (there are two and I don't feel like
posting two emails on that subject. The final stuff will be broken down
in smaller bits).

Cheers,
Ben.


[-- Attachment #2: 0001-Move-bulk-of-x86-__do_page_fault-to-a-generic_page_f.patch --]
[-- Type: text/x-patch, Size: 20798 bytes --]

From 1e3060ecdb479a3dfd587a5870e0351e0b1b5ddc Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Sat, 28 Feb 2015 17:38:17 +1100
Subject: [PATCH 1/2] Move bulk of x86 __do_page_fault() to a
 generic_page_fault()

(Add add various hooks that other archs will need etc...)

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/x86/include/asm/fault.h |  99 +++++++++++++++++
 arch/x86/mm/fault.c          | 253 +++----------------------------------------
 include/linux/fault.h        |  11 ++
 mm/Makefile                  |   2 +-
 mm/fault.c                   | 171 +++++++++++++++++++++++++++++
 5 files changed, 296 insertions(+), 240 deletions(-)
 create mode 100644 arch/x86/include/asm/fault.h
 create mode 100644 include/linux/fault.h
 create mode 100644 mm/fault.c

diff --git a/arch/x86/include/asm/fault.h b/arch/x86/include/asm/fault.h
new file mode 100644
index 0000000..7c1712e1
--- /dev/null
+++ b/arch/x86/include/asm/fault.h
@@ -0,0 +1,99 @@
+#ifndef _ASM_X86_FAULT_H
+#define _ASM_X86_FAULT_H
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+/*
+ * Page fault error code bits:
+ *
+ *   bit 0 ==	 0: no page found	1: protection fault
+ *   bit 1 ==	 0: read access		1: write access
+ *   bit 2 ==	 0: kernel-mode access	1: user-mode access
+ *   bit 3 ==				1: use of reserved bit detected
+ *   bit 4 ==				1: fault was an instruction fetch
+ */
+enum x86_pf_error_code {
+
+	PF_PROT		=		1 << 0,
+	PF_WRITE	=		1 << 1,
+	PF_USER		=		1 << 2,
+	PF_RSVD		=		1 << 3,
+	PF_INSTR	=		1 << 4,
+};
+
+static inline bool fault_is_user(struct pt_regs *regs, unsigned long err_code)
+{
+	return err_code & PF_USER;
+}
+
+static inline bool fault_is_write(struct pt_regs *regs, unsigned long err_code)
+{
+	return err_code & PF_WRITE;
+}
+
+/* Return type for do_page_fault */
+typedef void gpf_ret_t;
+
+#define FAULT_NO_ERR
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long err_code,
+				  unsigned long address,
+				  struct vm_area_struct *vma)
+{
+	/*
+	 * Accessing the stack below %sp is always a bug.
+	 * The large cushion allows instructions like enter
+	 * and pusha to work. ("enter $65535, $31" pushes
+	 * 32 pointers and then decrements %sp by 65535.)
+	 */
+	return address + 65536 + 32 * sizeof(unsigned long) >= regs->sp;
+}
+
+/* Access validity check */
+static inline bool access_error(struct pt_regs *regs, unsigned long err_code,
+				struct vm_area_struct *vma)
+{
+	if (err_code & PF_WRITE) {
+		/* write, present and write, not present: */
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return true;
+		return false;
+	}
+
+	/* read, present: */
+	if (unlikely(err_code & PF_PROT))
+		return true;
+
+	/* read, not present: */
+	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+		return true;
+
+	return false;
+}
+
+/* Error handlers */
+
+gpf_ret_t handle_bad_area(struct pt_regs *regs, unsigned long error_code,
+			  unsigned long address, int si_code);
+
+
+void no_context(struct pt_regs *regs, unsigned long error_code,
+		unsigned long address, int signal, int si_code);
+
+static inline gpf_ret_t handle_kernel_fault(struct pt_regs *regs,
+					    unsigned long error_code,
+					    unsigned long address, int sig,
+					    int si_code)
+{
+	no_context(regs, error_code, address, sig, si_code);
+}
+
+gpf_ret_t do_sigbus(struct pt_regs *regs, unsigned long error_code,
+		    unsigned long address, unsigned int fault);
+
+static inline void arch_account_major_fault(void) { }
+
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ede025f..b7ca60a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/hugetlb.h>		/* hstate_index_to_shift	*/
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
+#include <linux/fault.h>
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -24,24 +25,6 @@
 #include <asm/trace/exceptions.h>
 
 /*
- * Page fault error code bits:
- *
- *   bit 0 ==	 0: no page found	1: protection fault
- *   bit 1 ==	 0: read access		1: write access
- *   bit 2 ==	 0: kernel-mode access	1: user-mode access
- *   bit 3 ==				1: use of reserved bit detected
- *   bit 4 ==				1: fault was an instruction fetch
- */
-enum x86_pf_error_code {
-
-	PF_PROT		=		1 << 0,
-	PF_WRITE	=		1 << 1,
-	PF_USER		=		1 << 2,
-	PF_RSVD		=		1 << 3,
-	PF_INSTR	=		1 << 4,
-};
-
-/*
  * Returns 0 if mmiotrace is disabled, or if the fault is not
  * handled by mmiotrace:
  */
@@ -643,7 +626,7 @@ pgtable_bad(struct pt_regs *regs, unsigned long error_code,
 	oops_end(flags, regs, sig);
 }
 
-static noinline void
+noinline void
 no_context(struct pt_regs *regs, unsigned long error_code,
 	   unsigned long address, int signal, int si_code)
 {
@@ -748,8 +731,7 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 	printk(KERN_CONT "\n");
 }
 
-static void
-__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
+static void __bad_area(struct pt_regs *regs, unsigned long error_code,
 		       unsigned long address, int si_code)
 {
 	struct task_struct *tsk = current;
@@ -804,44 +786,20 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 	no_context(regs, error_code, address, SIGSEGV, si_code);
 }
 
-static noinline void
-bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
-{
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
-}
-
-static void
-__bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
-{
-	struct mm_struct *mm = current->mm;
-
-	/*
-	 * Something tried to access memory that isn't in our memory map..
-	 * Fix it, but check if it's kernel or user first..
-	 */
-	up_read(&mm->mmap_sem);
-
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
-}
-
-static noinline void
-bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+static inline void bad_area(struct pt_regs *regs, unsigned long error_code,
+			      unsigned long address)
 {
 	__bad_area(regs, error_code, address, SEGV_MAPERR);
 }
 
-static noinline void
-bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+gpf_ret_t handle_bad_area(struct pt_regs *regs, unsigned long error_code,
+			  unsigned long address, int si_code)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, si_code);
 }
 
-static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
-	  unsigned int fault)
+gpf_ret_t do_sigbus(struct pt_regs *regs, unsigned long error_code,
+		    unsigned long address, unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	int code = BUS_ADRERR;
@@ -871,40 +829,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
 }
 
-static noinline void
-mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
-{
-	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
-		no_context(regs, error_code, address, 0, 0);
-		return;
-	}
-
-	if (fault & VM_FAULT_OOM) {
-		/* Kernel mode? Handle exceptions or die: */
-		if (!(error_code & PF_USER)) {
-			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
-			return;
-		}
-
-		/*
-		 * We ran out of memory, call the OOM killer, and return the
-		 * userspace (which will retry the fault, or kill us if we got
-		 * oom-killed):
-		 */
-		pagefault_out_of_memory();
-	} else {
-		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
-			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
-		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
-		else
-			BUG();
-	}
-}
-
 static int spurious_fault_check(unsigned long error_code, pte_t *pte)
 {
 	if ((error_code & PF_WRITE) && !pte_write(*pte))
@@ -998,27 +922,6 @@ NOKPROBE_SYMBOL(spurious_fault);
 
 int show_unhandled_signals = 1;
 
-static inline int
-access_error(unsigned long error_code, struct vm_area_struct *vma)
-{
-	if (error_code & PF_WRITE) {
-		/* write, present and write, not present: */
-		if (unlikely(!(vma->vm_flags & VM_WRITE)))
-			return 1;
-		return 0;
-	}
-
-	/* read, present: */
-	if (unlikely(error_code & PF_PROT))
-		return 1;
-
-	/* read, not present: */
-	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
-		return 1;
-
-	return 0;
-}
-
 static int fault_in_kernel_space(unsigned long address)
 {
 	return address >= TASK_SIZE_MAX;
@@ -1054,11 +957,8 @@ static noinline void
 __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address)
 {
-	struct vm_area_struct *vma;
 	struct task_struct *tsk;
 	struct mm_struct *mm;
-	int fault, major = 0;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1107,7 +1007,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address);
 
 		return;
 	}
@@ -1120,7 +1020,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 
@@ -1129,7 +1029,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * in an atomic region then we must not take the fault:
 	 */
 	if (unlikely(in_atomic() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 
@@ -1143,137 +1043,12 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (user_mode_vm(regs)) {
 		local_irq_enable();
 		error_code |= PF_USER;
-		flags |= FAULT_FLAG_USER;
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
-	if (error_code & PF_WRITE)
-		flags |= FAULT_FLAG_WRITE;
-
-	/*
-	 * When running in the kernel we expect faults to occur only to
-	 * addresses in user space.  All other faults represent errors in
-	 * the kernel and should generate an OOPS.  Unfortunately, in the
-	 * case of an erroneous fault occurring in a code path which already
-	 * holds mmap_sem we will deadlock attempting to validate the fault
-	 * against the address space.  Luckily the kernel only validly
-	 * references user space from well defined areas of code, which are
-	 * listed in the exceptions table.
-	 *
-	 * As the vast majority of faults will be valid we will only perform
-	 * the source reference check when there is a possibility of a
-	 * deadlock. Attempt to lock the address space, if we cannot we then
-	 * validate the source. If this is invalid we can skip the address
-	 * space check, thus avoiding the deadlock:
-	 */
-	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
-		if ((error_code & PF_USER) == 0 &&
-		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
-			return;
-		}
-retry:
-		down_read(&mm->mmap_sem);
-	} else {
-		/*
-		 * The above down_read_trylock() might have succeeded in
-		 * which case we'll have missed the might_sleep() from
-		 * down_read():
-		 */
-		might_sleep();
-	}
-
-	vma = find_vma(mm, address);
-	if (unlikely(!vma)) {
-		bad_area(regs, error_code, address);
-		return;
-	}
-	if (likely(vma->vm_start <= address))
-		goto good_area;
-	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
-		bad_area(regs, error_code, address);
-		return;
-	}
-	if (error_code & PF_USER) {
-		/*
-		 * Accessing the stack below %sp is always a bug.
-		 * The large cushion allows instructions like enter
-		 * and pusha to work. ("enter $65535, $31" pushes
-		 * 32 pointers and then decrements %sp by 65535.)
-		 */
-		if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
-			bad_area(regs, error_code, address);
-			return;
-		}
-	}
-	if (unlikely(expand_stack(vma, address))) {
-		bad_area(regs, error_code, address);
-		return;
-	}
-
-	/*
-	 * Ok, we have a good vm_area for this memory access, so
-	 * we can handle it..
-	 */
-good_area:
-	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
-		return;
-	}
-
-	/*
-	 * If for any reason at all we couldn't handle the fault,
-	 * make sure we exit gracefully rather than endlessly redo
-	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
-	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
-	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
-	major |= fault & VM_FAULT_MAJOR;
-
-	/*
-	 * If we need to retry the mmap_sem has already been released,
-	 * and if there is a fatal signal pending there is no guarantee
-	 * that we made any progress. Handle this case first.
-	 */
-	if (unlikely(fault & VM_FAULT_RETRY)) {
-		/* Retry at most once */
-		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(tsk))
-				goto retry;
-		}
-
-		/* User mode? Just return to handle the fatal exception */
-		if (flags & FAULT_FLAG_USER)
-			return;
-
-		/* Not returning to user mode? Handle exceptions or die: */
-		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
-		return;
-	}
-
-	up_read(&mm->mmap_sem);
-	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
-		return;
-	}
-
-	/*
-	 * Major/minor page fault accounting. If any of the events
-	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-	 */
-	if (major) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
-	}
+	generic_page_fault(regs, tsk, error_code, address);
 
 	check_v8086_mode(regs, address, tsk);
 }
diff --git a/include/linux/fault.h b/include/linux/fault.h
new file mode 100644
index 0000000..590d909
--- /dev/null
+++ b/include/linux/fault.h
@@ -0,0 +1,11 @@
+#ifndef __FAULT_H
+#define __FAULT_H
+
+/* Generic page fault stuff */
+
+#include <asm/fault.h>
+
+gpf_ret_t generic_page_fault(struct pt_regs *regs, struct task_struct *tsk,
+			     unsigned long error_code, unsigned long address);
+
+#endif /* __FAULT_H */
diff --git a/mm/Makefile b/mm/Makefile
index 3c1caa2..f647ff1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -8,7 +8,7 @@ KASAN_SANITIZE_slub.o := n
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   vmalloc.o pagewalk.o pgtable-generic.o fault.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/fault.c b/mm/fault.c
new file mode 100644
index 0000000..bfeee0b
--- /dev/null
+++ b/mm/fault.c
@@ -0,0 +1,171 @@
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/perf_event.h>
+#include <linux/module.h>
+
+#include <asm/fault.h>
+
+static noinline gpf_ret_t mm_fault_error(struct pt_regs *regs,
+					 unsigned long error_code,
+					 unsigned long address,
+					 unsigned int fault)
+{
+	if (fatal_signal_pending(current) && !fault_is_user(regs, error_code))
+		return handle_kernel_fault(regs, error_code, address, 0, 0);
+
+	if (fault & VM_FAULT_OOM) {
+		/* Kernel mode? Handle exceptions or die: */
+		if (!fault_is_user(regs, error_code))
+			return handle_kernel_fault(regs, error_code, address,
+						   SIGSEGV, SEGV_MAPERR);
+
+		/*
+		 * We ran out of memory, call the OOM killer, and return the
+		 * userspace (which will retry the fault, or kill us if we got
+		 * oom-killed):
+		 */
+		pagefault_out_of_memory();
+	} else {
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
+			     VM_FAULT_HWPOISON_LARGE))
+			return do_sigbus(regs, error_code, address, fault);
+		else if (fault & VM_FAULT_SIGSEGV)
+			return handle_bad_area(regs, error_code, address,
+					       SEGV_MAPERR);
+		else
+			BUG();
+	}
+	return FAULT_NO_ERR;
+}
+
+gpf_ret_t generic_page_fault(struct pt_regs *regs, struct task_struct *tsk,
+			     unsigned long error_code, unsigned long address)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int fault, major = 0;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+
+	mm = tsk->mm;
+
+	if (fault_is_user(regs, error_code))
+		flags |= FAULT_FLAG_USER;
+
+	if (fault_is_write(regs, error_code))
+		flags |= FAULT_FLAG_WRITE;
+
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
+	/*
+	 * When running in the kernel we expect faults to occur only to
+	 * addresses in user space.  All other faults represent errors in
+	 * the kernel and should generate an OOPS.  Unfortunately, in the
+	 * case of an erroneous fault occurring in a code path which already
+	 * holds mmap_sem we will deadlock attempting to validate the fault
+	 * against the address space.  Luckily the kernel only validly
+	 * references user space from well defined areas of code, which are
+	 * listed in the exceptions table.
+	 *
+	 * As the vast majority of faults will be valid we will only perform
+	 * the source reference check when there is a possibility of a
+	 * deadlock. Attempt to lock the address space, if we cannot we then
+	 * validate the source. If this is invalid we can skip the address
+	 * space check, thus avoiding the deadlock:
+	 */
+	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
+		if (!fault_is_user(regs, error_code) &&
+		    !search_exception_tables(GET_IP(regs))) {
+			return handle_bad_area(regs, error_code, address,
+					       SEGV_MAPERR);
+		}
+retry:
+		down_read(&mm->mmap_sem);
+	} else {
+		/*
+		 * The above down_read_trylock() might have succeeded in
+		 * which case we'll have missed the might_sleep() from
+		 * down_read():
+		 */
+		might_sleep();
+	}
+
+	vma = find_vma(mm, address);
+	if (unlikely(!vma))
+		goto bad_area;
+	if (likely(vma->vm_start <= address))
+		goto good_area;
+	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
+		goto bad_area;
+	if (unlikely(fault_is_user(regs, error_code) &&
+		     !stack_can_grow(regs, error_code, address, vma)))
+		goto bad_area;
+	if (unlikely(expand_stack(vma, address)))
+		goto bad_area;
+
+	/*
+	 * Ok, we have a good vm_area for this memory access, so
+	 * we can handle it..
+	 */
+good_area:
+	if (unlikely(access_error(regs, error_code, vma)))
+		goto bad_access;
+
+	/*
+	 * If for any reason at all we couldn't handle the fault,
+	 * make sure we exit gracefully rather than endlessly redo
+	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
+	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
+	 */
+	fault = handle_mm_fault(mm, vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
+
+	/*
+	 * If we need to retry the mmap_sem has already been released,
+	 * and if there is a fatal signal pending there is no guarantee
+	 * that we made any progress. Handle this case first.
+	 */
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		/* Retry at most once */
+		if (flags & FAULT_FLAG_ALLOW_RETRY) {
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
+			if (!fatal_signal_pending(tsk))
+				goto retry;
+		}
+
+		/* User mode? Just return to handle the fatal exception */
+		if (flags & FAULT_FLAG_USER)
+			return FAULT_NO_ERR;
+
+		/* Not returning to user mode? Handle exceptions or die: */
+		return handle_kernel_fault(regs, error_code, address,
+					   SIGBUS, BUS_ADRERR);
+	}
+
+	up_read(&mm->mmap_sem);
+	if (unlikely(fault & VM_FAULT_ERROR))
+		return mm_fault_error(regs, error_code, address, fault);
+
+	/*
+	 * Major/minor page fault accounting. If any of the events
+	 * returned VM_FAULT_MAJOR, we account it as a major fault.
+	 */
+	if (major) {
+		tsk->maj_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+
+		/* Some archs want extra counting here */
+		arch_account_major_fault();
+	} else {
+		tsk->min_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
+	}
+	return FAULT_NO_ERR;
+
+ bad_area:
+	up_read(&mm->mmap_sem);
+	return handle_bad_area(regs, error_code, address, SEGV_MAPERR);	
+ bad_access:
+	up_read(&mm->mmap_sem);
+	return handle_bad_area(regs, error_code, address, SEGV_ACCERR);	
+}
-- 
2.1.0


[-- Attachment #3: 0002-powerpc-Use-generic_page_fault.patch --]
[-- Type: text/x-patch, Size: 19877 bytes --]

From 1c8e7e2ef295d6325796fcf3ce6f8825ffa7f58b Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Sat, 28 Feb 2015 17:38:48 +1100
Subject: [PATCH 2/2] powerpc: Use generic_page_fault()

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/fault.h | 165 ++++++++++++++++++++
 arch/powerpc/mm/fault.c          | 328 ++++++---------------------------------
 2 files changed, 215 insertions(+), 278 deletions(-)
 create mode 100644 arch/powerpc/include/asm/fault.h

diff --git a/arch/powerpc/include/asm/fault.h b/arch/powerpc/include/asm/fault.h
new file mode 100644
index 0000000..ebb46b9
--- /dev/null
+++ b/arch/powerpc/include/asm/fault.h
@@ -0,0 +1,165 @@
+#ifndef _ASM_POWERPC_FAULT_H
+#define _ASM_POWERPC_FAULT_H
+
+#include <linux/types.h>
+#include <linux/bug.h>
+
+#include <asm/ptrace.h>
+#include <asm/reg.h>
+#include <asm/firmware.h>
+#include <asm/paca.h>
+
+static inline bool fault_is_user(struct pt_regs *regs, unsigned long err_code)
+{
+	return user_mode(regs);
+}
+
+static inline bool fault_is_write(struct pt_regs *regs, unsigned long err_code)
+{
+#if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
+	return !!(err_code & DSISR_ISSTORE);
+#else
+	return !!(err_code & ESR_DST);
+#endif /* CONFIG_4xx || CONFIG_BOOKE */
+}
+
+/* We need to pass a couple of flags throug the generic page fault
+ * code via "error_code" which contains either the DSISR or the ESR
+ * content depending on the CPU family.
+ *
+ * We hijack bits that we don't use in either
+ */
+#define PF_CAN_GROW_STACK	0x00000001ul
+#define PF_EXEC			0x00000002ul
+
+/* Return type for do_page_fault */
+typedef int gpf_ret_t;
+
+#define FAULT_NO_ERR	0
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long err_code,
+				  unsigned long address,
+				  struct vm_area_struct *vma)
+{
+	/*
+	 * N.B. The POWER/Open ABI allows programs to access up to
+	 * 288 bytes below the stack pointer.
+	 * The kernel signal delivery code writes up to about 1.5kB
+	 * below the stack pointer (r1) before decrementing it.
+	 * The exec code can write slightly over 640kB to the stack
+	 * before setting the user r1.  Thus we allow the stack to
+	 * expand to 1MB without further checks.
+	 */
+	if (address + 0x100000 < vma->vm_end) {
+		/* get user regs even if this fault is in kernel mode */
+		struct pt_regs *uregs = current->thread.regs;
+		if (uregs == NULL)
+			return false;
+
+		/*
+		 * A user-mode access to an address a long way below
+		 * the stack pointer is only valid if the instruction
+		 * is one which would update the stack pointer to the
+		 * address accessed if the instruction completed,
+		 * i.e. either stwu rs,n(r1) or stwux rs,r1,rb
+		 * (or the byte, halfword, float or double forms).
+		 *
+		 * If we don't check this then any write to the area
+		 * between the last mapped region and the stack will
+		 * expand the stack rather than segfaulting.
+		 */
+		if (address + 2048 < uregs->gpr[1] &&
+		    !(err_code & PF_CAN_GROW_STACK))
+			return false;
+	}
+	return true;
+}
+
+static inline bool access_error(struct pt_regs *regs, unsigned long err_code,
+				struct vm_area_struct *vma)
+{
+#if defined(CONFIG_6xx)
+	/* an error such as lwarx to I/O controller space,
+	   address matching DABR, eciwx, etc. */
+	if (err_code & 0x95700000)
+		return true;
+#endif /* CONFIG_6xx */
+#if defined(CONFIG_8xx)
+        /* The MPC8xx seems to always set 0x80000000, which is
+         * "undefined".  Of those that can be set, this is the only
+         * one which seems bad.
+         */
+	if (err_code & 0x10000000)
+                /* Guarded storage error. */
+		return true;
+#endif /* CONFIG_8xx */
+
+	if (err_code & PF_EXEC) {
+		/*
+		 * Allow execution from readable areas if the MMU does not
+		 * provide separate controls over reading and executing.
+		 *
+		 * Note: That code used to not be enabled for 4xx/BookE.
+		 * It is now as I/D cache coherency for these is done at
+		 * set_pte_at() time and I see no reason why the test
+		 * below wouldn't be valid on those processors. This -may-
+		 * break programs compiled with a really old ABI though.
+		 */
+		if (!(vma->vm_flags & VM_EXEC) &&
+		    (cpu_has_feature(CPU_FTR_NOEXECUTE) ||
+		     !(vma->vm_flags & (VM_READ | VM_WRITE))))
+			return true;
+#ifdef CONFIG_PPC_STD_MMU
+		/*
+		 * protfault should only happen due to us
+		 * mapping a region readonly temporarily. PROT_NONE
+		 * is also covered by the VMA check above.
+		 */
+		WARN_ON_ONCE(err_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
+	/* a write */
+	} else if (fault_is_write(regs, err_code)) {
+		if (!(vma->vm_flags & VM_WRITE))
+			return true;
+	/* a read */
+	} else {
+		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
+			return true;
+		WARN_ON_ONCE(err_code & DSISR_PROTFAULT);
+	}
+	return false;
+}
+
+/* Error handlers */
+
+gpf_ret_t handle_bad_area(struct pt_regs *regs, unsigned long error_code,
+			  unsigned long address, int si_code);
+
+static inline gpf_ret_t handle_kernel_fault(struct pt_regs *regs,
+					    unsigned long error_code,
+					    unsigned long address, int sig,
+					    int si_code)
+{
+	return sig;
+}
+
+gpf_ret_t do_sigbus(struct pt_regs *regs, unsigned long error_code,
+		    unsigned long address, unsigned int fault);
+
+static inline void arch_account_major_fault(void)
+{
+#ifdef CONFIG_PPC_SMLPAR
+	if (firmware_has_feature(FW_FEATURE_CMO)) {
+		u32 page_ins;
+
+		preempt_disable();
+		page_ins = be32_to_cpu(get_lppaca()->page_ins);
+		page_ins += 1 << PAGE_FACTOR;
+		get_lppaca()->page_ins = cpu_to_be32(page_ins);
+		preempt_enable();
+	}
+#endif /* CONFIG_PPC_SMLPAR */
+}
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index b396868..c51c156 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -33,6 +33,7 @@
 #include <linux/ratelimit.h>
 #include <linux/context_tracking.h>
 #include <linux/hugetlb.h>
+#include <linux/fault.h>
 
 #include <asm/firmware.h>
 #include <asm/page.h>
@@ -72,15 +73,15 @@ static inline int notify_page_fault(struct pt_regs *regs)
  * Check whether the instruction at regs->nip is a store using
  * an update addressing form which will update r1.
  */
-static int store_updates_sp(struct pt_regs *regs)
+static bool store_updates_sp(struct pt_regs *regs)
 {
 	unsigned int inst;
 
 	if (get_user(inst, (unsigned int __user *)regs->nip))
-		return 0;
+		return false;
 	/* check for 1 in the rA field */
 	if (((inst >> 16) & 0x1f) != 1)
-		return 0;
+		return false;
 	/* check major opcode */
 	switch (inst >> 26) {
 	case 37:	/* stwu */
@@ -88,7 +89,7 @@ static int store_updates_sp(struct pt_regs *regs)
 	case 45:	/* sthu */
 	case 53:	/* stfsu */
 	case 55:	/* stfdu */
-		return 1;
+		return true;
 	case 62:	/* std or stdu */
 		return (inst & 3) == 1;
 	case 31:
@@ -100,10 +101,10 @@ static int store_updates_sp(struct pt_regs *regs)
 		case 439:	/* sthux */
 		case 695:	/* stfsux */
 		case 759:	/* stfdux */
-			return 1;
+			return true;
 		}
 	}
-	return 0;
+	return false;
 }
 /*
  * do_page_fault error handling helpers
@@ -113,16 +114,14 @@ static int store_updates_sp(struct pt_regs *regs)
 #define MM_FAULT_CONTINUE	-1
 #define MM_FAULT_ERR(sig)	(sig)
 
-static int do_sigbus(struct pt_regs *regs, unsigned long address,
-		     unsigned int fault)
+gpf_ret_t do_sigbus(struct pt_regs *regs, unsigned long error_code,
+		    unsigned long address, unsigned int fault)
 {
 	siginfo_t info;
 	unsigned int lsb = 0;
 
-	up_read(&current->mm->mmap_sem);
-
 	if (!user_mode(regs))
-		return MM_FAULT_ERR(SIGBUS);
+		return SIGBUS;
 
 	current->thread.trap_nr = BUS_ADRERR;
 	info.si_signo = SIGBUS;
@@ -143,53 +142,25 @@ static int do_sigbus(struct pt_regs *regs, unsigned long address,
 #endif
 	info.si_addr_lsb = lsb;
 	force_sig_info(SIGBUS, &info, current);
-	return MM_FAULT_RETURN;
+	return 0;
 }
 
-static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault)
+gpf_ret_t handle_bad_area(struct pt_regs *regs, unsigned long error_code,
+			  unsigned long address, int si_code)
 {
-	/*
-	 * Pagefault was interrupted by SIGKILL. We have no reason to
-	 * continue the pagefault.
-	 */
-	if (fatal_signal_pending(current)) {
-		/*
-		 * If we have retry set, the mmap semaphore will have
-		 * alrady been released in __lock_page_or_retry(). Else
-		 * we release it now.
-		 */
-		if (!(fault & VM_FAULT_RETRY))
-			up_read(&current->mm->mmap_sem);
-		/* Coming from kernel, we need to deal with uaccess fixups */
-		if (user_mode(regs))
-			return MM_FAULT_RETURN;
-		return MM_FAULT_ERR(SIGKILL);
-	}
 
-	/* No fault: be happy */
-	if (!(fault & VM_FAULT_ERROR))
-		return MM_FAULT_CONTINUE;
-
-	/* Out of memory */
-	if (fault & VM_FAULT_OOM) {
-		up_read(&current->mm->mmap_sem);
-
-		/*
-		 * We ran out of memory, or some other thing happened to us that
-		 * made us unable to handle the page fault gracefully.
-		 */
-		if (!user_mode(regs))
-			return MM_FAULT_ERR(SIGKILL);
-		pagefault_out_of_memory();
-		return MM_FAULT_RETURN;
+	/* User mode accesses cause a SIGSEGV */
+	if (user_mode(regs)) {
+		_exception(SIGSEGV, regs, si_code, address);
+		return 0;
 	}
 
-	if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
-		return do_sigbus(regs, addr, fault);
+	if ((error_code & PF_EXEC) && (error_code & DSISR_PROTFAULT))
+		printk_ratelimited(KERN_CRIT "kernel tried to execute NX-protected"
+				   " page (%lx) - exploit attempt? (uid: %d)\n",
+				   address, from_kuid(&init_user_ns, current_uid()));
 
-	/* We don't understand the fault code, this is fatal */
-	BUG();
-	return MM_FAULT_CONTINUE;
+	return SIGSEGV;
 }
 
 /*
@@ -205,19 +176,11 @@ static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault)
  * The return value is 0 if the fault was handled, or the signal
  * number if this is a kernel fault that can't be handled here.
  */
-int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
-			    unsigned long error_code)
+static int __do_page_fault(struct pt_regs *regs, unsigned long address,
+			   unsigned long error_code)
 {
-	enum ctx_state prev_state = exception_enter();
-	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
-	int code = SEGV_MAPERR;
-	int is_write = 0;
 	int trap = TRAP(regs);
- 	int is_exec = trap == 0x400;
-	int fault;
-	int rc = 0, store_update_sp = 0;
 
 #if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
 	/*
@@ -228,10 +191,6 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 	if (trap == 0x400)
 		error_code &= 0x48200000;
-	else
-		is_write = error_code & DSISR_ISSTORE;
-#else
-	is_write = error_code & ESR_DST;
 #endif /* CONFIG_4xx || CONFIG_BOOKE */
 
 #ifdef CONFIG_PPC_ICSWX
@@ -241,30 +200,28 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 	 * look at it
 	 */
 	if (error_code & ICSWX_DSI_UCT) {
-		rc = acop_handle_fault(regs, address, error_code);
+		gfp_ret_t rc = acop_handle_fault(regs, address, error_code);
 		if (rc)
-			goto bail;
+			return rc;
 	}
 #endif /* CONFIG_PPC_ICSWX */
 
 	if (notify_page_fault(regs))
-		goto bail;
+		return 0;
 
 	if (unlikely(debugger_fault_handler(regs)))
-		goto bail;
+		return 0;
 
 	/* On a kernel SLB miss we can only check for a valid exception entry */
-	if (!user_mode(regs) && (address >= TASK_SIZE)) {
-		rc = SIGSEGV;
-		goto bail;
-	}
+	if (!user_mode(regs) && (address >= TASK_SIZE))
+		return SIGSEGV;
 
 #if !(defined(CONFIG_4xx) || defined(CONFIG_BOOKE) || \
 			     defined(CONFIG_PPC_BOOK3S_64))
   	if (error_code & DSISR_DABRMATCH) {
 		/* breakpoint match */
 		do_break(regs, address, error_code);
-		goto bail;
+		return 0;
 	}
 #endif
 
@@ -273,10 +230,9 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 		local_irq_enable();
 
 	if (in_atomic() || mm == NULL) {
-		if (!user_mode(regs)) {
-			rc = SIGSEGV;
-			goto bail;
-		}
+		if (!user_mode(regs))
+			return SIGSEGV;
+
 		/* in_atomic() in user mode is really bad,
 		   as is current->mm == NULL. */
 		printk(KERN_EMERG "Page fault in user mode with "
@@ -286,220 +242,36 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 		die("Weird page fault", regs, SIGSEGV);
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+	error_code &= ~(PF_CAN_GROW_STACK | PF_EXEC);
 
 	/*
 	 * We want to do this outside mmap_sem, because reading code around nip
 	 * can result in fault, which will cause a deadlock when called with
 	 * mmap_sem held
 	 */
-	if (user_mode(regs))
-		store_update_sp = store_updates_sp(regs);
-
-	if (user_mode(regs))
-		flags |= FAULT_FLAG_USER;
-
-	/* When running in the kernel we expect faults to occur only to
-	 * addresses in user space.  All other faults represent errors in the
-	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
-	 * erroneous fault occurring in a code path which already holds mmap_sem
-	 * we will deadlock attempting to validate the fault against the
-	 * address space.  Luckily the kernel only validly references user
-	 * space from well defined areas of code, which are listed in the
-	 * exceptions table.
-	 *
-	 * As the vast majority of faults will be valid we will only perform
-	 * the source reference check when there is a possibility of a deadlock.
-	 * Attempt to lock the address space, if we cannot we then validate the
-	 * source.  If this is invalid we can skip the address space check,
-	 * thus avoiding the deadlock.
-	 */
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		if (!user_mode(regs) && !search_exception_tables(regs->nip))
-			goto bad_area_nosemaphore;
-
-retry:
-		down_read(&mm->mmap_sem);
-	} else {
-		/*
-		 * The above down_read_trylock() might have succeeded in
-		 * which case we'll have missed the might_sleep() from
-		 * down_read():
-		 */
-		might_sleep();
-	}
-
-	vma = find_vma(mm, address);
-	if (!vma)
-		goto bad_area;
-	if (vma->vm_start <= address)
-		goto good_area;
-	if (!(vma->vm_flags & VM_GROWSDOWN))
-		goto bad_area;
-
-	/*
-	 * N.B. The POWER/Open ABI allows programs to access up to
-	 * 288 bytes below the stack pointer.
-	 * The kernel signal delivery code writes up to about 1.5kB
-	 * below the stack pointer (r1) before decrementing it.
-	 * The exec code can write slightly over 640kB to the stack
-	 * before setting the user r1.  Thus we allow the stack to
-	 * expand to 1MB without further checks.
-	 */
-	if (address + 0x100000 < vma->vm_end) {
-		/* get user regs even if this fault is in kernel mode */
-		struct pt_regs *uregs = current->thread.regs;
-		if (uregs == NULL)
-			goto bad_area;
-
-		/*
-		 * A user-mode access to an address a long way below
-		 * the stack pointer is only valid if the instruction
-		 * is one which would update the stack pointer to the
-		 * address accessed if the instruction completed,
-		 * i.e. either stwu rs,n(r1) or stwux rs,r1,rb
-		 * (or the byte, halfword, float or double forms).
-		 *
-		 * If we don't check this then any write to the area
-		 * between the last mapped region and the stack will
-		 * expand the stack rather than segfaulting.
-		 */
-		if (address + 2048 < uregs->gpr[1] && !store_update_sp)
-			goto bad_area;
-	}
-	if (expand_stack(vma, address))
-		goto bad_area;
-
-good_area:
-	code = SEGV_ACCERR;
-#if defined(CONFIG_6xx)
-	if (error_code & 0x95700000)
-		/* an error such as lwarx to I/O controller space,
-		   address matching DABR, eciwx, etc. */
-		goto bad_area;
-#endif /* CONFIG_6xx */
-#if defined(CONFIG_8xx)
-        /* The MPC8xx seems to always set 0x80000000, which is
-         * "undefined".  Of those that can be set, this is the only
-         * one which seems bad.
-         */
-	if (error_code & 0x10000000)
-                /* Guarded storage error. */
-		goto bad_area;
-#endif /* CONFIG_8xx */
-
-	if (is_exec) {
-		/*
-		 * Allow execution from readable areas if the MMU does not
-		 * provide separate controls over reading and executing.
-		 *
-		 * Note: That code used to not be enabled for 4xx/BookE.
-		 * It is now as I/D cache coherency for these is done at
-		 * set_pte_at() time and I see no reason why the test
-		 * below wouldn't be valid on those processors. This -may-
-		 * break programs compiled with a really old ABI though.
-		 */
-		if (!(vma->vm_flags & VM_EXEC) &&
-		    (cpu_has_feature(CPU_FTR_NOEXECUTE) ||
-		     !(vma->vm_flags & (VM_READ | VM_WRITE))))
-			goto bad_area;
-#ifdef CONFIG_PPC_STD_MMU
-		/*
-		 * protfault should only happen due to us
-		 * mapping a region readonly temporarily. PROT_NONE
-		 * is also covered by the VMA check above.
-		 */
-		WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
-#endif /* CONFIG_PPC_STD_MMU */
-	/* a write */
-	} else if (is_write) {
-		if (!(vma->vm_flags & VM_WRITE))
-			goto bad_area;
-		flags |= FAULT_FLAG_WRITE;
-	/* a read */
-	} else {
-		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
-			goto bad_area;
-		WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
-	}
-
-	/*
-	 * If for any reason at all we couldn't handle the fault,
-	 * make sure we exit gracefully rather than endlessly redo
-	 * the fault.
-	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
-	if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
-		if (fault & VM_FAULT_SIGSEGV)
-			goto bad_area;
-		rc = mm_fault_error(regs, address, fault);
-		if (rc >= MM_FAULT_RETURN)
-			goto bail;
-		else
-			rc = 0;
-	}
-
-	/*
-	 * Major/minor page fault accounting is only done on the
-	 * initial attempt. If we go through a retry, it is extremely
-	 * likely that the page will be found in page cache at that point.
-	 */
-	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			current->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
-				      regs, address);
-#ifdef CONFIG_PPC_SMLPAR
-			if (firmware_has_feature(FW_FEATURE_CMO)) {
-				u32 page_ins;
-
-				preempt_disable();
-				page_ins = be32_to_cpu(get_lppaca()->page_ins);
-				page_ins += 1 << PAGE_FACTOR;
-				get_lppaca()->page_ins = cpu_to_be32(page_ins);
-				preempt_enable();
-			}
-#endif /* CONFIG_PPC_SMLPAR */
-		} else {
-			current->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
-				      regs, address);
-		}
-		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-			flags |= FAULT_FLAG_TRIED;
-			goto retry;
-		}
-	}
-
-	up_read(&mm->mmap_sem);
-	goto bail;
+	if (user_mode(regs) && store_updates_sp(regs))
+		error_code |= PF_CAN_GROW_STACK;
 
-bad_area:
-	up_read(&mm->mmap_sem);
-
-bad_area_nosemaphore:
-	/* User mode accesses cause a SIGSEGV */
-	if (user_mode(regs)) {
-		_exception(SIGSEGV, regs, code, address);
-		goto bail;
-	}
+	/* Set flag if exec fault for use by access_error */
+	if (trap == 0x400)
+		error_code |= PF_EXEC;
 
-	if (is_exec && (error_code & DSISR_PROTFAULT))
-		printk_ratelimited(KERN_CRIT "kernel tried to execute NX-protected"
-				   " page (%lx) - exploit attempt? (uid: %d)\n",
-				   address, from_kuid(&init_user_ns, current_uid()));
+	/* Generic page fault */
+	return generic_page_fault(regs, current, error_code, address);
+}
 
-	rc = SIGSEGV;
+int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
+			    unsigned long error_code)
+{
+	enum ctx_state prev_state = exception_enter();
+	int rc;
 
-bail:
+	rc = __do_page_fault(regs, address, error_code);
 	exception_exit(prev_state);
 	return rc;
-
 }
 
+
 /*
  * bad_page_fault is called when we have a bad access from the kernel.
  * It is called from the DSI and ISI handlers in head.S and from some
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28  7:12                     ` Generic page fault (Was: libsigsegv ....) Benjamin Herrenschmidt
@ 2015-02-28  7:14                         ` Benjamin Herrenschmidt
  2015-02-28 19:56                         ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28  7:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
> 
> Let me know what you think of the approach. It's boot tested on x86_64
> in qemu and 

 ... and powerpc64 LE on qemu as well :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28  7:14                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28  7:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
> 
> Let me know what you think of the approach. It's boot tested on x86_64
> in qemu and 

 ... and powerpc64 LE on qemu as well :-)

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28  7:14                         ` Benjamin Herrenschmidt
@ 2015-02-28 10:36                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 10:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 18:14 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
> > 
> > Let me know what you think of the approach. It's boot tested on x86_64
> > in qemu and 
> 
>  ... and powerpc64 LE on qemu as well :-)

One thing I'd like to do is fold handle_kernel_fault() into
handle_bad_area() (and in fact fold do_sigbus as well).

Basically have a single handle_bad_fault() that takes sig and si_code as
arguments which we call from the generic code for all faults. It will
test for kernel vs. user mode and do the right thing (and we could
handle the sigbus special case by simply comparing sig to sigbus).

The one reason I haven't done it so far is that x86 handle_bad_area()
has the is_f00f_bug() call which isn't do for other cases of
no_context() and I'm not sure generalizing it is safe for all cases (or
maybe I can call it only when sig is SIGSEGV ?) ... I don't actually
understand what it does :)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 10:36                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 10:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 18:14 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
> > 
> > Let me know what you think of the approach. It's boot tested on x86_64
> > in qemu and 
> 
>  ... and powerpc64 LE on qemu as well :-)

One thing I'd like to do is fold handle_kernel_fault() into
handle_bad_area() (and in fact fold do_sigbus as well).

Basically have a single handle_bad_fault() that takes sig and si_code as
arguments which we call from the generic code for all faults. It will
test for kernel vs. user mode and do the right thing (and we could
handle the sigbus special case by simply comparing sig to sigbus).

The one reason I haven't done it so far is that x86 handle_bad_area()
has the is_f00f_bug() call which isn't do for other cases of
no_context() and I'm not sure generalizing it is safe for all cases (or
maybe I can call it only when sig is SIGSEGV ?) ... I don't actually
understand what it does :)

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28  7:12                     ` Generic page fault (Was: libsigsegv ....) Benjamin Herrenschmidt
@ 2015-02-28 19:56                         ` Linus Torvalds
  2015-02-28 19:56                         ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-02-28 19:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Let me know what you think of the approach.

Hmm. I'm not happy with just how many of those arch wrapper/helper
functions there are, and some of it looks a bit unportable.

For example, the code "knows" that "err_code" and "address" are the
only two architecture-specific pieces of information (in addition to
"struct pt_regs", of course.

And the fault_is_user/write() stuff seems unnecessary - we have the
generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
in the generic version to the generic function, we pass in the
arch-specific ones.

The same goes for "access_error()". Right now it's an arch-specific
helper function, but it actually does some generic tests (like
checking the vma protections), and I think it could/should be entirely
generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
information to the "flags" register. Just looking at the ppc version,
my gut feel is that "access_error()" is just horribly error-prone
as-is even from an architecture standpoint.

Yes, the "read/exec/notpresent" bits are a bit weird, but old x86
isn't the only architecture that doesn't necessarily know the
difference between read and exec.

So I'd like a bit more encapsulation. The generic code should never
really need to look at the arch-specific details, although it is true
that then the error handling cases will likely need it (in order to
put it on the arch-specific stack info or similar).

So my *gut* feel is that the generic code should be passed

 - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
by the caller)

   These are the only things the generic code should need to use itself

 - I guess we do need to pass in "struct pt_regs", because things like
generic tracing actually want it.

 - an opaque "arch specific fault information structure pointer". Not
used for anything but passing back to the error functions (ie very
much *not* for helper functions for the normal case, like the current
"access_error()" - if it's actually needed for those kinds of things,
then I'm wrong about the model)

   This would (for x86) contain "error_code", but I have the feeling
that other architectures might need/want more than one word of
information.

But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
just have this feeling that it coudl be more "generic", and less
"random small arch helpers".

                              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 19:56                         ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-02-28 19:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Let me know what you think of the approach.

Hmm. I'm not happy with just how many of those arch wrapper/helper
functions there are, and some of it looks a bit unportable.

For example, the code "knows" that "err_code" and "address" are the
only two architecture-specific pieces of information (in addition to
"struct pt_regs", of course.

And the fault_is_user/write() stuff seems unnecessary - we have the
generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
in the generic version to the generic function, we pass in the
arch-specific ones.

The same goes for "access_error()". Right now it's an arch-specific
helper function, but it actually does some generic tests (like
checking the vma protections), and I think it could/should be entirely
generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
information to the "flags" register. Just looking at the ppc version,
my gut feel is that "access_error()" is just horribly error-prone
as-is even from an architecture standpoint.

Yes, the "read/exec/notpresent" bits are a bit weird, but old x86
isn't the only architecture that doesn't necessarily know the
difference between read and exec.

So I'd like a bit more encapsulation. The generic code should never
really need to look at the arch-specific details, although it is true
that then the error handling cases will likely need it (in order to
put it on the arch-specific stack info or similar).

So my *gut* feel is that the generic code should be passed

 - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
by the caller)

   These are the only things the generic code should need to use itself

 - I guess we do need to pass in "struct pt_regs", because things like
generic tracing actually want it.

 - an opaque "arch specific fault information structure pointer". Not
used for anything but passing back to the error functions (ie very
much *not* for helper functions for the normal case, like the current
"access_error()" - if it's actually needed for those kinds of things,
then I'm wrong about the model)

   This would (for x86) contain "error_code", but I have the feeling
that other architectures might need/want more than one word of
information.

But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
just have this feeling that it coudl be more "generic", and less
"random small arch helpers".

                              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 19:56                         ` Linus Torvalds
@ 2015-02-28 19:58                           ` Linus Torvalds
  -1 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-02-28 19:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, Feb 28, 2015 at 11:56 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
> just have this feeling that it coudl be more "generic", and less
> "random small arch helpers".

Oh, and I definitely agree with you on the "single handle_bad_fault()"
thing rather than the current
handle_bad_area/handle_kernel_fault/do_sigbus.

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 19:58                           ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-02-28 19:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, Feb 28, 2015 at 11:56 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
> just have this feeling that it coudl be more "generic", and less
> "random small arch helpers".

Oh, and I definitely agree with you on the "single handle_bad_fault()"
thing rather than the current
handle_bad_area/handle_kernel_fault/do_sigbus.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 19:56                         ` Linus Torvalds
@ 2015-02-28 21:14                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 11:56 -0800, Linus Torvalds wrote:
> On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > Let me know what you think of the approach.
> 
> Hmm. I'm not happy with just how many of those arch wrapper/helper
> functions there are, and some of it looks a bit unportable.
> 
> For example, the code "knows" that "err_code" and "address" are the
> only two architecture-specific pieces of information (in addition to
> "struct pt_regs", of course.
> 
> And the fault_is_user/write() stuff seems unnecessary - we have the
> generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
> in the generic version to the generic function, we pass in the
> arch-specific ones.

I was trying to keep that fault mask local but it might indeed be a
mistake.

The reasoning is that if you look at more archs (the scary one is
sparc), there is potentially more arch "gunk" that needs to be hooked in
after find_vma()...

And a lot of that need arch specific flags coming in. For example on ppc
I check if the instruction is allowed to grow the stack before we take
the mm sem and pass that information for use by the stack growing check.

Here I could just I suppose add some arch specific fault flags but it
might get messy... 

> The same goes for "access_error()". Right now it's an arch-specific
> helper function, but it actually does some generic tests (like
> checking the vma protections), and I think it could/should be entirely
> generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
> information to the "flags" register. Just looking at the ppc version,
> my gut feel is that "access_error()" is just horribly error-prone
> as-is even from an architecture standpoint.

I was weary of going down the path of having access_error() be an arch
helper because it gets even wilder with other archs and the crazy
platform flags we deal with on ppc. Here. I was thinking of looking at
factoring that in a second step. You are right that most of it seems
to related to execute permission however (at least sparc, mips, powerpc
and arm).

The way to go would be to define a FAULT_FLAG_EXEC which the arch only
sets if it supports enforcing exec at fault time.

BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

> Yes, the "read/exec/notpresent" bits are a bit weird, but old x86
> isn't the only architecture that doesn't necessarily know the
> difference between read and exec.
> 
> So I'd like a bit more encapsulation. The generic code should never
> really need to look at the arch-specific details, although it is true
> that then the error handling cases will likely need it (in order to
> put it on the arch-specific stack info or similar).

Right, we might still have to pass error_code around, I'll have a look.

> So my *gut* feel is that the generic code should be passed
> 
>  - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
> by the caller)
> 
>    These are the only things the generic code should need to use itself
>
>  - I guess we do need to pass in "struct pt_regs", because things like
> generic tracing actually want it.

And error handling but it could be in the latter...

>  - an opaque "arch specific fault information structure pointer". Not
> used for anything but passing back to the error functions (ie very
> much *not* for helper functions for the normal case, like the current
> "access_error()" - if it's actually needed for those kinds of things,
> then I'm wrong about the model)

Well, it might be needed for stack growth check. It's also somewhat
means we have yet *another* fault information structure which is a bit
sad but I don't see a good way to fold them into one.

For errors I was thinking instead of returning error info from the
generic code and leaving the generation of oops/signal to the caller,
which means less arch gunk to carry around.

>    This would (for x86) contain "error_code", but I have the feeling
> that other architectures might need/want more than one word of
> information.
> 
> But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
> just have this feeling that it coudl be more "generic", and less
> "random small arch helpers".

I don't like it that much either, it's a first draft to see what the
waters are like.

I'll play around more and see what It comes to.

Cheers,
Ben.

> 
>                               Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 21:14                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 11:56 -0800, Linus Torvalds wrote:
> On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > Let me know what you think of the approach.
> 
> Hmm. I'm not happy with just how many of those arch wrapper/helper
> functions there are, and some of it looks a bit unportable.
> 
> For example, the code "knows" that "err_code" and "address" are the
> only two architecture-specific pieces of information (in addition to
> "struct pt_regs", of course.
> 
> And the fault_is_user/write() stuff seems unnecessary - we have the
> generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
> in the generic version to the generic function, we pass in the
> arch-specific ones.

I was trying to keep that fault mask local but it might indeed be a
mistake.

The reasoning is that if you look at more archs (the scary one is
sparc), there is potentially more arch "gunk" that needs to be hooked in
after find_vma()...

And a lot of that need arch specific flags coming in. For example on ppc
I check if the instruction is allowed to grow the stack before we take
the mm sem and pass that information for use by the stack growing check.

Here I could just I suppose add some arch specific fault flags but it
might get messy... 

> The same goes for "access_error()". Right now it's an arch-specific
> helper function, but it actually does some generic tests (like
> checking the vma protections), and I think it could/should be entirely
> generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
> information to the "flags" register. Just looking at the ppc version,
> my gut feel is that "access_error()" is just horribly error-prone
> as-is even from an architecture standpoint.

I was weary of going down the path of having access_error() be an arch
helper because it gets even wilder with other archs and the crazy
platform flags we deal with on ppc. Here. I was thinking of looking at
factoring that in a second step. You are right that most of it seems
to related to execute permission however (at least sparc, mips, powerpc
and arm).

The way to go would be to define a FAULT_FLAG_EXEC which the arch only
sets if it supports enforcing exec at fault time.

BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

> Yes, the "read/exec/notpresent" bits are a bit weird, but old x86
> isn't the only architecture that doesn't necessarily know the
> difference between read and exec.
> 
> So I'd like a bit more encapsulation. The generic code should never
> really need to look at the arch-specific details, although it is true
> that then the error handling cases will likely need it (in order to
> put it on the arch-specific stack info or similar).

Right, we might still have to pass error_code around, I'll have a look.

> So my *gut* feel is that the generic code should be passed
> 
>  - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
> by the caller)
> 
>    These are the only things the generic code should need to use itself
>
>  - I guess we do need to pass in "struct pt_regs", because things like
> generic tracing actually want it.

And error handling but it could be in the latter...

>  - an opaque "arch specific fault information structure pointer". Not
> used for anything but passing back to the error functions (ie very
> much *not* for helper functions for the normal case, like the current
> "access_error()" - if it's actually needed for those kinds of things,
> then I'm wrong about the model)

Well, it might be needed for stack growth check. It's also somewhat
means we have yet *another* fault information structure which is a bit
sad but I don't see a good way to fold them into one.

For errors I was thinking instead of returning error info from the
generic code and leaving the generation of oops/signal to the caller,
which means less arch gunk to carry around.

>    This would (for x86) contain "error_code", but I have the feeling
> that other architectures might need/want more than one word of
> information.
> 
> But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
> just have this feeling that it coudl be more "generic", and less
> "random small arch helpers".

I don't like it that much either, it's a first draft to see what the
waters are like.

I'll play around more and see what It comes to.

Cheers,
Ben.

> 
>                               Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 21:14                           ` Benjamin Herrenschmidt
@ 2015-02-28 21:49                             ` Linus Torvalds
  -1 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-02-28 21:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, Feb 28, 2015 at 1:14 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

It doesn't. x86 traditionally doesn't have an execute bit, so
traditionally "read == exec".

So PF_INSTR really wasn't historically very useful, in that it would
only show if the *first* access to a page was an instruction fetch -
if you did a regular read to brign the page in, then subsequent
instruction fetches would just work.

Then NX came along, and what happens now is

 - we handle write faults separately (see the first part of access_error()

 - so now we know it was a read or an instruction fetch

 - if PF_PROT is set, that means that the present bit was set in the
page tables, so it must have been an exec access to a NX page

 - otherwise, we just say "PROTNONE means no access, otherwise
populate the page tables"

.. and if it turns out that it was a PF_INSTR to a NX page, we'll end
up taking the page fault *again* after it's been populated, and now
since the page table was populated, the access_error() will catch it
with the PF_PROT case.

Or something like that. I might have screwed up some detail, but it
should all work.

                     Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 21:49                             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-02-28 21:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, Feb 28, 2015 at 1:14 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

It doesn't. x86 traditionally doesn't have an execute bit, so
traditionally "read == exec".

So PF_INSTR really wasn't historically very useful, in that it would
only show if the *first* access to a page was an instruction fetch -
if you did a regular read to brign the page in, then subsequent
instruction fetches would just work.

Then NX came along, and what happens now is

 - we handle write faults separately (see the first part of access_error()

 - so now we know it was a read or an instruction fetch

 - if PF_PROT is set, that means that the present bit was set in the
page tables, so it must have been an exec access to a NX page

 - otherwise, we just say "PROTNONE means no access, otherwise
populate the page tables"

.. and if it turns out that it was a PF_INSTR to a NX page, we'll end
up taking the page fault *again* after it's been populated, and now
since the page table was populated, the access_error() will catch it
with the PF_PROT case.

Or something like that. I might have screwed up some detail, but it
should all work.

                     Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 21:14                           ` Benjamin Herrenschmidt
@ 2015-02-28 22:16                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 22:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

So for error handling, I'm trying to simply return the VM_FAULT_* flags
from generic_page_fault see where that takes us. That's a way to avoid
passing an arch specific struct around. It also allows my hack to
account major faults with the hypervisor to be done outside the generic
code completely (no hook).

We will process generically some of the flags first such as the repeat
logic or major/minor accounting of course.

For that to work, I'm adding a VM_FAULT_ACCESS (that gets OR'ed with
VM_FAULT_SIGSEGV) to differentiate SEGV_MAPERR and SEGV_ACCERR. So far
so good.

However, I noticed a small discrepancy on x86 in the handling of fatal
signals:

I see two path that can be hit on a fatal signal. The "obvious"
one is the one in access_error() which calls no_context() with a 0
signal argument, the other path is in the retry handling, which will in
this case call no_context() with SIGBUS/BUS_ADRERR. 

Now, the only place in there that seems to care about the signal that
gets passed in is the sig_on_uaccess_error case. On one case (0 sig),
that test will be skipped, on the other case (SIGBUS), that test will be
done and might result in a sigbus being generated, which might override
the original deadly signal (or am I missing something ?)

Now I don't completely understand how the x86 vsyscall stuff works so I
don't know precisely in what circumstances that test matters, I'll need
you help there.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 22:16                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 22:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

So for error handling, I'm trying to simply return the VM_FAULT_* flags
from generic_page_fault see where that takes us. That's a way to avoid
passing an arch specific struct around. It also allows my hack to
account major faults with the hypervisor to be done outside the generic
code completely (no hook).

We will process generically some of the flags first such as the repeat
logic or major/minor accounting of course.

For that to work, I'm adding a VM_FAULT_ACCESS (that gets OR'ed with
VM_FAULT_SIGSEGV) to differentiate SEGV_MAPERR and SEGV_ACCERR. So far
so good.

However, I noticed a small discrepancy on x86 in the handling of fatal
signals:

I see two path that can be hit on a fatal signal. The "obvious"
one is the one in access_error() which calls no_context() with a 0
signal argument, the other path is in the retry handling, which will in
this case call no_context() with SIGBUS/BUS_ADRERR. 

Now, the only place in there that seems to care about the signal that
gets passed in is the sig_on_uaccess_error case. On one case (0 sig),
that test will be skipped, on the other case (SIGBUS), that test will be
done and might result in a sigbus being generated, which might override
the original deadly signal (or am I missing something ?)

Now I don't completely understand how the x86 vsyscall stuff works so I
don't know precisely in what circumstances that test matters, I'll need
you help there.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 21:49                             ` Linus Torvalds
@ 2015-02-28 22:49                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 22:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 13:49 -0800, Linus Torvalds wrote:

 .../...

>  - we handle write faults separately (see the first part of access_error()
> 
>  - so now we know it was a read or an instruction fetch
> 
>  - if PF_PROT is set, that means that the present bit was set in the
> page tables, so it must have been an exec access to a NX page
> 
>  - otherwise, we just say "PROTNONE means no access, otherwise
> populate the page tables"
> 
> .. and if it turns out that it was a PF_INSTR to a NX page, we'll end
> up taking the page fault *again* after it's been populated, and now
> since the page table was populated, the access_error() will catch it
> with the PF_PROT case.
> 
> Or something like that. I might have screwed up some detail, but it
> should all work.

I see, it should work yes, I'll still add that FAULT_FLAG_EXEC for
those who can tell reliably but it shouldn't hurt for x86 to not set it.

Cheers,
Ben.


>                      Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 22:49                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 22:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 13:49 -0800, Linus Torvalds wrote:

 .../...

>  - we handle write faults separately (see the first part of access_error()
> 
>  - so now we know it was a read or an instruction fetch
> 
>  - if PF_PROT is set, that means that the present bit was set in the
> page tables, so it must have been an exec access to a NX page
> 
>  - otherwise, we just say "PROTNONE means no access, otherwise
> populate the page tables"
> 
> .. and if it turns out that it was a PF_INSTR to a NX page, we'll end
> up taking the page fault *again* after it's been populated, and now
> since the page table was populated, the access_error() will catch it
> with the PF_PROT case.
> 
> Or something like that. I might have screwed up some detail, but it
> should all work.

I see, it should work yes, I'll still add that FAULT_FLAG_EXEC for
those who can tell reliably but it shouldn't hurt for x86 to not set it.

Cheers,
Ben.


>                      Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 22:16                             ` Benjamin Herrenschmidt
@ 2015-02-28 22:50                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
> So for error handling, I'm trying to simply return the VM_FAULT_* flags
> from generic_page_fault see where that takes us. That's a way to avoid
> passing an arch specific struct around. It also allows my hack to
> account major faults with the hypervisor to be done outside the generic
> code completely (no hook).

I suspect sparc64 will defeat the "not passing an arch struct around"
though... oh well.

Ben.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 22:50                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
> So for error handling, I'm trying to simply return the VM_FAULT_* flags
> from generic_page_fault see where that takes us. That's a way to avoid
> passing an arch specific struct around. It also allows my hack to
> account major faults with the hypervisor to be done outside the generic
> code completely (no hook).

I suspect sparc64 will defeat the "not passing an arch struct around"
though... oh well.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 22:16                             ` Benjamin Herrenschmidt
@ 2015-02-28 23:02                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 23:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
> So for error handling, I'm trying to simply return the VM_FAULT_* flags
> from generic_page_fault see where that takes us. That's a way to avoid
> passing an arch specific struct around. It also allows my hack to
> account major faults with the hypervisor to be done outside the generic
> code completely (no hook).
>
> .../...

Here's what it looks like for x86 only and without completely sorting
out the fatal signal business. However I might still have to do the
arch pointer you mentioned for sparc and possibly other archs, but so
far it looks better already.

Note that if I add that arch pointer, I might stop messing around
or even returning "fault" and instead just return a simple enum
minor,major,error and let inline arch hooks populate the arch pointer
with the error details in whatever fashion the arch prefers. However
I suspect they'll all end up with sig and si_code in there...

Anyway, here's the current patch:

 arch/x86/include/asm/fault.h |  21 ++++
 arch/x86/mm/fault.c          | 233 ++++---------------------------------------
 include/linux/fault.h        |  24 +++++
 include/linux/mm.h           |   5 +-
 mm/Makefile                  |   2 +-
 mm/fault.c                   | 196 ++++++++++++++++++++++++++++++++++++
 6 files changed, 266 insertions(+), 215 deletions(-)
 create mode 100644 arch/x86/include/asm/fault.h
 create mode 100644 include/linux/fault.h
 create mode 100644 mm/fault.c

diff --git a/arch/x86/include/asm/fault.h b/arch/x86/include/asm/fault.h
new file mode 100644
index 0000000..04263ec
--- /dev/null
+++ b/arch/x86/include/asm/fault.h
@@ -0,0 +1,21 @@
+#ifndef _ASM_X86_FAULT_H
+#define _ASM_X86_FAULT_H
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long flags,
+				  unsigned long address,
+				  struct vm_area_struct *vma)
+{
+	/*
+	 * Accessing the stack below %sp is always a bug.
+	 * The large cushion allows instructions like enter
+	 * and pusha to work. ("enter $65535, $31" pushes
+	 * 32 pointers and then decrements %sp by 65535.)
+	 */
+	return address + 65536 + 32 * sizeof(unsigned long) >= regs->sp;
+}
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ede025f..19a8a91 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/hugetlb.h>		/* hstate_index_to_shift	*/
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
+#include <linux/fault.h>
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -748,8 +749,7 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 	printk(KERN_CONT "\n");
 }
 
-static void
-__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
+static void __bad_area(struct pt_regs *regs, unsigned long error_code,
 		       unsigned long address, int si_code)
 {
 	struct task_struct *tsk = current;
@@ -804,39 +804,10 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 	no_context(regs, error_code, address, SIGSEGV, si_code);
 }
 
-static noinline void
-bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
-{
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
-}
-
-static void
-__bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
-{
-	struct mm_struct *mm = current->mm;
-
-	/*
-	 * Something tried to access memory that isn't in our memory map..
-	 * Fix it, but check if it's kernel or user first..
-	 */
-	up_read(&mm->mmap_sem);
-
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
-}
-
-static noinline void
-bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
-{
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
-}
-
-static noinline void
-bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+static inline void bad_area(struct pt_regs *regs, unsigned long error_code,
+			    unsigned long address, int si_code)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, si_code);
 }
 
 static void
@@ -871,40 +842,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
 }
 
-static noinline void
-mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
-{
-	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
-		no_context(regs, error_code, address, 0, 0);
-		return;
-	}
-
-	if (fault & VM_FAULT_OOM) {
-		/* Kernel mode? Handle exceptions or die: */
-		if (!(error_code & PF_USER)) {
-			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
-			return;
-		}
-
-		/*
-		 * We ran out of memory, call the OOM killer, and return the
-		 * userspace (which will retry the fault, or kill us if we got
-		 * oom-killed):
-		 */
-		pagefault_out_of_memory();
-	} else {
-		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
-			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
-		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
-		else
-			BUG();
-	}
-}
-
 static int spurious_fault_check(unsigned long error_code, pte_t *pte)
 {
 	if ((error_code & PF_WRITE) && !pte_write(*pte))
@@ -998,27 +935,6 @@ NOKPROBE_SYMBOL(spurious_fault);
 
 int show_unhandled_signals = 1;
 
-static inline int
-access_error(unsigned long error_code, struct vm_area_struct *vma)
-{
-	if (error_code & PF_WRITE) {
-		/* write, present and write, not present: */
-		if (unlikely(!(vma->vm_flags & VM_WRITE)))
-			return 1;
-		return 0;
-	}
-
-	/* read, present: */
-	if (unlikely(error_code & PF_PROT))
-		return 1;
-
-	/* read, not present: */
-	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
-		return 1;
-
-	return 0;
-}
-
 static int fault_in_kernel_space(unsigned long address)
 {
 	return address >= TASK_SIZE_MAX;
@@ -1054,11 +970,10 @@ static noinline void
 __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address)
 {
-	struct vm_area_struct *vma;
 	struct task_struct *tsk;
 	struct mm_struct *mm;
-	int fault, major = 0;
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int fault;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1107,7 +1022,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address, SEGV_MAPERR);
 
 		return;
 	}
@@ -1120,7 +1035,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address, SEGV_MAPERR);
 		return;
 	}
 
@@ -1129,13 +1044,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * in an atomic region then we must not take the fault:
 	 */
 	if (unlikely(in_atomic() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address, SEGV_MAPERR);
 		return;
 	}
 
 	/*
 	 * It's safe to allow irq's after cr2 has been saved and the
 	 * vmalloc fault has been handled.
+
 	 *
 	 * User-mode registers count as a user access even for any
 	 * potential system fault or CPU buglet:
@@ -1143,138 +1059,29 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (user_mode_vm(regs)) {
 		local_irq_enable();
 		error_code |= PF_USER;
-		flags |= FAULT_FLAG_USER;
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
+	if (error_code & PF_USER)
+		flags |= FAULT_FLAG_USER;
 	if (error_code & PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
+	if (error_code & PF_PROT)
+		flags |= FAULT_FLAG_PROT;
 
-	/*
-	 * When running in the kernel we expect faults to occur only to
-	 * addresses in user space.  All other faults represent errors in
-	 * the kernel and should generate an OOPS.  Unfortunately, in the
-	 * case of an erroneous fault occurring in a code path which already
-	 * holds mmap_sem we will deadlock attempting to validate the fault
-	 * against the address space.  Luckily the kernel only validly
-	 * references user space from well defined areas of code, which are
-	 * listed in the exceptions table.
-	 *
-	 * As the vast majority of faults will be valid we will only perform
-	 * the source reference check when there is a possibility of a
-	 * deadlock. Attempt to lock the address space, if we cannot we then
-	 * validate the source. If this is invalid we can skip the address
-	 * space check, thus avoiding the deadlock:
-	 */
-	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
-		if ((error_code & PF_USER) == 0 &&
-		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
-			return;
-		}
-retry:
-		down_read(&mm->mmap_sem);
-	} else {
-		/*
-		 * The above down_read_trylock() might have succeeded in
-		 * which case we'll have missed the might_sleep() from
-		 * down_read():
-		 */
-		might_sleep();
-	}
-
-	vma = find_vma(mm, address);
-	if (unlikely(!vma)) {
-		bad_area(regs, error_code, address);
-		return;
-	}
-	if (likely(vma->vm_start <= address))
-		goto good_area;
-	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
-		bad_area(regs, error_code, address);
+	fault = generic_page_fault(regs, tsk, flags, address);
+	if (unlikely(fault & VM_FAULT_SIGSEGV)) {
+		bad_area(regs, error_code, address,
+			 (fault & VM_FAULT_ACCESS) ? SEGV_MAPERR : SEGV_ACCERR);
 		return;
 	}
-	if (error_code & PF_USER) {
-		/*
-		 * Accessing the stack below %sp is always a bug.
-		 * The large cushion allows instructions like enter
-		 * and pusha to work. ("enter $65535, $31" pushes
-		 * 32 pointers and then decrements %sp by 65535.)
-		 */
-		if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
-			bad_area(regs, error_code, address);
-			return;
-		}
-	}
-	if (unlikely(expand_stack(vma, address))) {
-		bad_area(regs, error_code, address);
+	if (unlikely(fault & VM_FAULT_SIGBUS)) {
+		do_sigbus(regs, error_code, address, fault);
 		return;
 	}
 
-	/*
-	 * Ok, we have a good vm_area for this memory access, so
-	 * we can handle it..
-	 */
-good_area:
-	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
-		return;
-	}
-
-	/*
-	 * If for any reason at all we couldn't handle the fault,
-	 * make sure we exit gracefully rather than endlessly redo
-	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
-	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
-	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
-	major |= fault & VM_FAULT_MAJOR;
-
-	/*
-	 * If we need to retry the mmap_sem has already been released,
-	 * and if there is a fatal signal pending there is no guarantee
-	 * that we made any progress. Handle this case first.
-	 */
-	if (unlikely(fault & VM_FAULT_RETRY)) {
-		/* Retry at most once */
-		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(tsk))
-				goto retry;
-		}
-
-		/* User mode? Just return to handle the fatal exception */
-		if (flags & FAULT_FLAG_USER)
-			return;
-
-		/* Not returning to user mode? Handle exceptions or die: */
-		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
-		return;
-	}
-
-	up_read(&mm->mmap_sem);
-	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
-		return;
-	}
-
-	/*
-	 * Major/minor page fault accounting. If any of the events
-	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-	 */
-	if (major) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
-	}
-
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(__do_page_fault);
diff --git a/include/linux/fault.h b/include/linux/fault.h
new file mode 100644
index 0000000..11c567e
--- /dev/null
+++ b/include/linux/fault.h
@@ -0,0 +1,24 @@
+#ifndef __FAULT_H
+#define __FAULT_H
+
+/* Generic page fault stuff */
+
+#include <asm/fault.h>
+
+/* Returns the fault flags from handle_mm_fault() with the addition
+ * that:
+ *
+ * - On an error, either VM_FAULT_SIGSEGV or VM_FAULT_SIGBUS will
+ *   always be set in addition to other flags
+ *
+ * - VM_FAULT_ACCESS will be added to VM_FAULT_SIGSEGV for access
+ *   protection faults
+ *
+ * - VM_FAULT_MAJOR will be set even if there was a retry
+ *
+ * - XXX FIXME: VM_FAULT_SIGBUS will be set on a fatal signal
+ */
+unsigned int generic_page_fault(struct pt_regs *regs, struct task_struct *tsk,
+				unsigned long flags, unsigned long address);
+
+#endif /* __FAULT_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..5578eba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -211,6 +211,9 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
+#define FAULT_FLAG_EXEC		0x80	/* The fault was an instruction fetch */
+#define FAULT_FLAG_PROT		0x100	/* HW detected protection fault */
+#define FAULT_FLAG_NO_STK_GROW	0x200	/* Fault is not allowed to grow stack */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1098,7 +1101,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 #define VM_FAULT_SIGSEGV 0x0040
-
+#define VM_FAULT_ACCESS  0x0080 /* in addition to VM_FAULT_SIGSEGV */
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
diff --git a/mm/Makefile b/mm/Makefile
index 3c1caa2..f647ff1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -8,7 +8,7 @@ KASAN_SANITIZE_slub.o := n
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   vmalloc.o pagewalk.o pgtable-generic.o fault.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/fault.c b/mm/fault.c
new file mode 100644
index 0000000..3bf4583
--- /dev/null
+++ b/mm/fault.c
@@ -0,0 +1,196 @@
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/perf_event.h>
+#include <linux/module.h>
+
+#include <asm/fault.h>
+
+static noinline unsigned int mm_fault_error(struct pt_regs *regs,
+					    unsigned int flags,
+					    unsigned long address,
+					    unsigned int fault)
+{
+	/* XXX Hack in VM_FAULT_SIGBUS, we need to fix that */
+	if (fatal_signal_pending(current) && !(flags & FAULT_FLAG_USER))
+		return fault | VM_FAULT_SIGBUS;
+
+	if (fault & VM_FAULT_OOM) {
+		/* Kernel mode? Handle exceptions or die: */
+		if (!(flags & FAULT_FLAG_USER))
+			return fault | VM_FAULT_SIGSEGV;
+
+		/*
+		 * We ran out of memory, call the OOM killer, and return the
+		 * userspace (which will retry the fault, or kill us if we got
+		 * oom-killed):
+		 */
+		pagefault_out_of_memory();
+	} else {
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
+			     VM_FAULT_HWPOISON_LARGE))
+			return fault | VM_FAULT_SIGBUS;
+		else if (fault & VM_FAULT_SIGSEGV)
+			return fault;
+		else
+			BUG();
+	}
+
+	/* Clear error conditions */
+	return fault & ~VM_FAULT_ERROR;
+}
+
+/* Access validity check */
+static inline bool access_error(struct pt_regs *regs, unsigned long flags,
+				struct vm_area_struct *vma)
+{
+	/* Write fault to a non-writeable VMA */
+	if (flags & FAULT_FLAG_WRITE) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return true;
+		return false;
+	}
+
+	/* Exec fault to a non-executable VMA */
+	if (flags & FAULT_FLAG_EXEC) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return true;
+		return false;
+	}
+
+	/* Other HW detected protection fault */
+	if (unlikely(flags & PF_PROT))
+		return true;
+
+	/* No access allowed to that VMA */
+	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+		return true;
+
+	return false;
+}
+
+unsigned int generic_page_fault(struct pt_regs *regs, struct task_struct *tsk,
+				unsigned int flags, unsigned long address)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int fault, major = 0;
+
+	mm = tsk->mm;
+
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
+	/*
+	 * When running in the kernel we expect faults to occur only to
+	 * addresses in user space.  All other faults represent errors in
+	 * the kernel and should generate an OOPS.  Unfortunately, in the
+	 * case of an erroneous fault occurring in a code path which already
+	 * holds mmap_sem we will deadlock attempting to validate the fault
+	 * against the address space.  Luckily the kernel only validly
+	 * references user space from well defined areas of code, which are
+	 * listed in the exceptions table.
+	 *
+	 * As the vast majority of faults will be valid we will only perform
+	 * the source reference check when there is a possibility of a
+	 * deadlock. Attempt to lock the address space, if we cannot we then
+	 * validate the source. If this is invalid we can skip the address
+	 * space check, thus avoiding the deadlock:
+	 */
+	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
+		if (!(flags & FAULT_FLAG_USER) &&
+		    !search_exception_tables(GET_IP(regs))) {
+			return VM_FAULT_SIGSEGV;
+		}
+retry:
+		down_read(&mm->mmap_sem);
+	} else {
+		/*
+		 * The above down_read_trylock() might have succeeded in
+		 * which case we'll have missed the might_sleep() from
+		 * down_read():
+		 */
+		might_sleep();
+	}
+
+	vma = find_vma(mm, address);
+	if (unlikely(!vma))
+		goto bad_area;
+	if (likely(vma->vm_start <= address))
+		goto good_area;
+	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
+		goto bad_area;
+	if (unlikely((flags & FAULT_FLAG_USER) &&
+		     !stack_can_grow(regs, flags, address, vma)))
+		goto bad_area;
+	if (unlikely(expand_stack(vma, address)))
+		goto bad_area;
+
+	/*
+	 * Ok, we have a good vm_area for this memory access, so
+	 * we can handle it..
+	 */
+good_area:
+	if (unlikely(access_error(regs, flags, vma)))
+		goto bad_access;
+
+	/*
+	 * If for any reason at all we couldn't handle the fault,
+	 * make sure we exit gracefully rather than endlessly redo
+	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
+	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
+	 */
+	fault = handle_mm_fault(mm, vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
+
+	/*
+	 * If we need to retry the mmap_sem has already been released,
+	 * and if there is a fatal signal pending there is no guarantee
+	 * that we made any progress. Handle this case first.
+	 */
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		/* Retry at most once */
+		if (flags & FAULT_FLAG_ALLOW_RETRY) {
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
+			if (!fatal_signal_pending(tsk))
+				goto retry;
+		}
+
+		/* User mode? Just return to handle the fatal exception */
+		if (flags & FAULT_FLAG_USER)
+			return FAULT_NO_ERR;
+
+		/* Not returning to user mode? Handle exceptions or die: */
+		/* XXX mimmic x86, but might not be best */
+		return fault | VM_FAULT_SIGBUS;
+	}
+
+	up_read(&mm->mmap_sem);
+	if (unlikely(fault & VM_FAULT_ERROR))
+		return mm_fault_error(regs, flags, address, fault);
+
+	/*
+	 * Major/minor page fault accounting. If any of the events
+	 * returned VM_FAULT_MAJOR, we account it as a major fault.
+	 */
+	if (major) {
+		tsk->maj_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+
+		/* Restore the major flag, it might have been lost in case of
+		 * retry and the arch might care
+		 */
+		if (major)
+			fault |= VM_FAULT_MAJOR;
+	} else {
+		tsk->min_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
+	}
+	return fault;
+
+ bad_area:
+	up_read(&mm->mmap_sem);
+	return VM_FAULT_SIGSEGV;
+ bad_access:
+	up_read(&mm->mmap_sem);
+	return VM_FAULT_SIGSEGV | VM_FAULT_ACCESS;
+}



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-02-28 23:02                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-28 23:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
> So for error handling, I'm trying to simply return the VM_FAULT_* flags
> from generic_page_fault see where that takes us. That's a way to avoid
> passing an arch specific struct around. It also allows my hack to
> account major faults with the hypervisor to be done outside the generic
> code completely (no hook).
>
> .../...

Here's what it looks like for x86 only and without completely sorting
out the fatal signal business. However I might still have to do the
arch pointer you mentioned for sparc and possibly other archs, but so
far it looks better already.

Note that if I add that arch pointer, I might stop messing around
or even returning "fault" and instead just return a simple enum
minor,major,error and let inline arch hooks populate the arch pointer
with the error details in whatever fashion the arch prefers. However
I suspect they'll all end up with sig and si_code in there...

Anyway, here's the current patch:

 arch/x86/include/asm/fault.h |  21 ++++
 arch/x86/mm/fault.c          | 233 ++++---------------------------------------
 include/linux/fault.h        |  24 +++++
 include/linux/mm.h           |   5 +-
 mm/Makefile                  |   2 +-
 mm/fault.c                   | 196 ++++++++++++++++++++++++++++++++++++
 6 files changed, 266 insertions(+), 215 deletions(-)
 create mode 100644 arch/x86/include/asm/fault.h
 create mode 100644 include/linux/fault.h
 create mode 100644 mm/fault.c

diff --git a/arch/x86/include/asm/fault.h b/arch/x86/include/asm/fault.h
new file mode 100644
index 0000000..04263ec
--- /dev/null
+++ b/arch/x86/include/asm/fault.h
@@ -0,0 +1,21 @@
+#ifndef _ASM_X86_FAULT_H
+#define _ASM_X86_FAULT_H
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long flags,
+				  unsigned long address,
+				  struct vm_area_struct *vma)
+{
+	/*
+	 * Accessing the stack below %sp is always a bug.
+	 * The large cushion allows instructions like enter
+	 * and pusha to work. ("enter $65535, $31" pushes
+	 * 32 pointers and then decrements %sp by 65535.)
+	 */
+	return address + 65536 + 32 * sizeof(unsigned long) >= regs->sp;
+}
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ede025f..19a8a91 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/hugetlb.h>		/* hstate_index_to_shift	*/
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
+#include <linux/fault.h>
 
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
@@ -748,8 +749,7 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 	printk(KERN_CONT "\n");
 }
 
-static void
-__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
+static void __bad_area(struct pt_regs *regs, unsigned long error_code,
 		       unsigned long address, int si_code)
 {
 	struct task_struct *tsk = current;
@@ -804,39 +804,10 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 	no_context(regs, error_code, address, SIGSEGV, si_code);
 }
 
-static noinline void
-bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
-{
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
-}
-
-static void
-__bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
-{
-	struct mm_struct *mm = current->mm;
-
-	/*
-	 * Something tried to access memory that isn't in our memory map..
-	 * Fix it, but check if it's kernel or user first..
-	 */
-	up_read(&mm->mmap_sem);
-
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
-}
-
-static noinline void
-bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
-{
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
-}
-
-static noinline void
-bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+static inline void bad_area(struct pt_regs *regs, unsigned long error_code,
+			    unsigned long address, int si_code)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, si_code);
 }
 
 static void
@@ -871,40 +842,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
 }
 
-static noinline void
-mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
-{
-	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
-		no_context(regs, error_code, address, 0, 0);
-		return;
-	}
-
-	if (fault & VM_FAULT_OOM) {
-		/* Kernel mode? Handle exceptions or die: */
-		if (!(error_code & PF_USER)) {
-			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
-			return;
-		}
-
-		/*
-		 * We ran out of memory, call the OOM killer, and return the
-		 * userspace (which will retry the fault, or kill us if we got
-		 * oom-killed):
-		 */
-		pagefault_out_of_memory();
-	} else {
-		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
-			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
-		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
-		else
-			BUG();
-	}
-}
-
 static int spurious_fault_check(unsigned long error_code, pte_t *pte)
 {
 	if ((error_code & PF_WRITE) && !pte_write(*pte))
@@ -998,27 +935,6 @@ NOKPROBE_SYMBOL(spurious_fault);
 
 int show_unhandled_signals = 1;
 
-static inline int
-access_error(unsigned long error_code, struct vm_area_struct *vma)
-{
-	if (error_code & PF_WRITE) {
-		/* write, present and write, not present: */
-		if (unlikely(!(vma->vm_flags & VM_WRITE)))
-			return 1;
-		return 0;
-	}
-
-	/* read, present: */
-	if (unlikely(error_code & PF_PROT))
-		return 1;
-
-	/* read, not present: */
-	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
-		return 1;
-
-	return 0;
-}
-
 static int fault_in_kernel_space(unsigned long address)
 {
 	return address >= TASK_SIZE_MAX;
@@ -1054,11 +970,10 @@ static noinline void
 __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address)
 {
-	struct vm_area_struct *vma;
 	struct task_struct *tsk;
 	struct mm_struct *mm;
-	int fault, major = 0;
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int fault;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1107,7 +1022,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address, SEGV_MAPERR);
 
 		return;
 	}
@@ -1120,7 +1035,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address, SEGV_MAPERR);
 		return;
 	}
 
@@ -1129,13 +1044,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * in an atomic region then we must not take the fault:
 	 */
 	if (unlikely(in_atomic() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area(regs, error_code, address, SEGV_MAPERR);
 		return;
 	}
 
 	/*
 	 * It's safe to allow irq's after cr2 has been saved and the
 	 * vmalloc fault has been handled.
+
 	 *
 	 * User-mode registers count as a user access even for any
 	 * potential system fault or CPU buglet:
@@ -1143,138 +1059,29 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (user_mode_vm(regs)) {
 		local_irq_enable();
 		error_code |= PF_USER;
-		flags |= FAULT_FLAG_USER;
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
+	if (error_code & PF_USER)
+		flags |= FAULT_FLAG_USER;
 	if (error_code & PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
+	if (error_code & PF_PROT)
+		flags |= FAULT_FLAG_PROT;
 
-	/*
-	 * When running in the kernel we expect faults to occur only to
-	 * addresses in user space.  All other faults represent errors in
-	 * the kernel and should generate an OOPS.  Unfortunately, in the
-	 * case of an erroneous fault occurring in a code path which already
-	 * holds mmap_sem we will deadlock attempting to validate the fault
-	 * against the address space.  Luckily the kernel only validly
-	 * references user space from well defined areas of code, which are
-	 * listed in the exceptions table.
-	 *
-	 * As the vast majority of faults will be valid we will only perform
-	 * the source reference check when there is a possibility of a
-	 * deadlock. Attempt to lock the address space, if we cannot we then
-	 * validate the source. If this is invalid we can skip the address
-	 * space check, thus avoiding the deadlock:
-	 */
-	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
-		if ((error_code & PF_USER) == 0 &&
-		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
-			return;
-		}
-retry:
-		down_read(&mm->mmap_sem);
-	} else {
-		/*
-		 * The above down_read_trylock() might have succeeded in
-		 * which case we'll have missed the might_sleep() from
-		 * down_read():
-		 */
-		might_sleep();
-	}
-
-	vma = find_vma(mm, address);
-	if (unlikely(!vma)) {
-		bad_area(regs, error_code, address);
-		return;
-	}
-	if (likely(vma->vm_start <= address))
-		goto good_area;
-	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
-		bad_area(regs, error_code, address);
+	fault = generic_page_fault(regs, tsk, flags, address);
+	if (unlikely(fault & VM_FAULT_SIGSEGV)) {
+		bad_area(regs, error_code, address,
+			 (fault & VM_FAULT_ACCESS) ? SEGV_MAPERR : SEGV_ACCERR);
 		return;
 	}
-	if (error_code & PF_USER) {
-		/*
-		 * Accessing the stack below %sp is always a bug.
-		 * The large cushion allows instructions like enter
-		 * and pusha to work. ("enter $65535, $31" pushes
-		 * 32 pointers and then decrements %sp by 65535.)
-		 */
-		if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
-			bad_area(regs, error_code, address);
-			return;
-		}
-	}
-	if (unlikely(expand_stack(vma, address))) {
-		bad_area(regs, error_code, address);
+	if (unlikely(fault & VM_FAULT_SIGBUS)) {
+		do_sigbus(regs, error_code, address, fault);
 		return;
 	}
 
-	/*
-	 * Ok, we have a good vm_area for this memory access, so
-	 * we can handle it..
-	 */
-good_area:
-	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
-		return;
-	}
-
-	/*
-	 * If for any reason at all we couldn't handle the fault,
-	 * make sure we exit gracefully rather than endlessly redo
-	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
-	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
-	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
-	major |= fault & VM_FAULT_MAJOR;
-
-	/*
-	 * If we need to retry the mmap_sem has already been released,
-	 * and if there is a fatal signal pending there is no guarantee
-	 * that we made any progress. Handle this case first.
-	 */
-	if (unlikely(fault & VM_FAULT_RETRY)) {
-		/* Retry at most once */
-		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(tsk))
-				goto retry;
-		}
-
-		/* User mode? Just return to handle the fatal exception */
-		if (flags & FAULT_FLAG_USER)
-			return;
-
-		/* Not returning to user mode? Handle exceptions or die: */
-		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
-		return;
-	}
-
-	up_read(&mm->mmap_sem);
-	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
-		return;
-	}
-
-	/*
-	 * Major/minor page fault accounting. If any of the events
-	 * returned VM_FAULT_MAJOR, we account it as a major fault.
-	 */
-	if (major) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
-	}
-
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(__do_page_fault);
diff --git a/include/linux/fault.h b/include/linux/fault.h
new file mode 100644
index 0000000..11c567e
--- /dev/null
+++ b/include/linux/fault.h
@@ -0,0 +1,24 @@
+#ifndef __FAULT_H
+#define __FAULT_H
+
+/* Generic page fault stuff */
+
+#include <asm/fault.h>
+
+/* Returns the fault flags from handle_mm_fault() with the addition
+ * that:
+ *
+ * - On an error, either VM_FAULT_SIGSEGV or VM_FAULT_SIGBUS will
+ *   always be set in addition to other flags
+ *
+ * - VM_FAULT_ACCESS will be added to VM_FAULT_SIGSEGV for access
+ *   protection faults
+ *
+ * - VM_FAULT_MAJOR will be set even if there was a retry
+ *
+ * - XXX FIXME: VM_FAULT_SIGBUS will be set on a fatal signal
+ */
+unsigned int generic_page_fault(struct pt_regs *regs, struct task_struct *tsk,
+				unsigned long flags, unsigned long address);
+
+#endif /* __FAULT_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..5578eba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -211,6 +211,9 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
+#define FAULT_FLAG_EXEC		0x80	/* The fault was an instruction fetch */
+#define FAULT_FLAG_PROT		0x100	/* HW detected protection fault */
+#define FAULT_FLAG_NO_STK_GROW	0x200	/* Fault is not allowed to grow stack */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1098,7 +1101,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 #define VM_FAULT_SIGSEGV 0x0040
-
+#define VM_FAULT_ACCESS  0x0080 /* in addition to VM_FAULT_SIGSEGV */
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
diff --git a/mm/Makefile b/mm/Makefile
index 3c1caa2..f647ff1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -8,7 +8,7 @@ KASAN_SANITIZE_slub.o := n
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   vmalloc.o pagewalk.o pgtable-generic.o fault.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/fault.c b/mm/fault.c
new file mode 100644
index 0000000..3bf4583
--- /dev/null
+++ b/mm/fault.c
@@ -0,0 +1,196 @@
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/perf_event.h>
+#include <linux/module.h>
+
+#include <asm/fault.h>
+
+static noinline unsigned int mm_fault_error(struct pt_regs *regs,
+					    unsigned int flags,
+					    unsigned long address,
+					    unsigned int fault)
+{
+	/* XXX Hack in VM_FAULT_SIGBUS, we need to fix that */
+	if (fatal_signal_pending(current) && !(flags & FAULT_FLAG_USER))
+		return fault | VM_FAULT_SIGBUS;
+
+	if (fault & VM_FAULT_OOM) {
+		/* Kernel mode? Handle exceptions or die: */
+		if (!(flags & FAULT_FLAG_USER))
+			return fault | VM_FAULT_SIGSEGV;
+
+		/*
+		 * We ran out of memory, call the OOM killer, and return the
+		 * userspace (which will retry the fault, or kill us if we got
+		 * oom-killed):
+		 */
+		pagefault_out_of_memory();
+	} else {
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
+			     VM_FAULT_HWPOISON_LARGE))
+			return fault | VM_FAULT_SIGBUS;
+		else if (fault & VM_FAULT_SIGSEGV)
+			return fault;
+		else
+			BUG();
+	}
+
+	/* Clear error conditions */
+	return fault & ~VM_FAULT_ERROR;
+}
+
+/* Access validity check */
+static inline bool access_error(struct pt_regs *regs, unsigned long flags,
+				struct vm_area_struct *vma)
+{
+	/* Write fault to a non-writeable VMA */
+	if (flags & FAULT_FLAG_WRITE) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return true;
+		return false;
+	}
+
+	/* Exec fault to a non-executable VMA */
+	if (flags & FAULT_FLAG_EXEC) {
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return true;
+		return false;
+	}
+
+	/* Other HW detected protection fault */
+	if (unlikely(flags & PF_PROT))
+		return true;
+
+	/* No access allowed to that VMA */
+	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+		return true;
+
+	return false;
+}
+
+unsigned int generic_page_fault(struct pt_regs *regs, struct task_struct *tsk,
+				unsigned int flags, unsigned long address)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int fault, major = 0;
+
+	mm = tsk->mm;
+
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
+	/*
+	 * When running in the kernel we expect faults to occur only to
+	 * addresses in user space.  All other faults represent errors in
+	 * the kernel and should generate an OOPS.  Unfortunately, in the
+	 * case of an erroneous fault occurring in a code path which already
+	 * holds mmap_sem we will deadlock attempting to validate the fault
+	 * against the address space.  Luckily the kernel only validly
+	 * references user space from well defined areas of code, which are
+	 * listed in the exceptions table.
+	 *
+	 * As the vast majority of faults will be valid we will only perform
+	 * the source reference check when there is a possibility of a
+	 * deadlock. Attempt to lock the address space, if we cannot we then
+	 * validate the source. If this is invalid we can skip the address
+	 * space check, thus avoiding the deadlock:
+	 */
+	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
+		if (!(flags & FAULT_FLAG_USER) &&
+		    !search_exception_tables(GET_IP(regs))) {
+			return VM_FAULT_SIGSEGV;
+		}
+retry:
+		down_read(&mm->mmap_sem);
+	} else {
+		/*
+		 * The above down_read_trylock() might have succeeded in
+		 * which case we'll have missed the might_sleep() from
+		 * down_read():
+		 */
+		might_sleep();
+	}
+
+	vma = find_vma(mm, address);
+	if (unlikely(!vma))
+		goto bad_area;
+	if (likely(vma->vm_start <= address))
+		goto good_area;
+	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
+		goto bad_area;
+	if (unlikely((flags & FAULT_FLAG_USER) &&
+		     !stack_can_grow(regs, flags, address, vma)))
+		goto bad_area;
+	if (unlikely(expand_stack(vma, address)))
+		goto bad_area;
+
+	/*
+	 * Ok, we have a good vm_area for this memory access, so
+	 * we can handle it..
+	 */
+good_area:
+	if (unlikely(access_error(regs, flags, vma)))
+		goto bad_access;
+
+	/*
+	 * If for any reason at all we couldn't handle the fault,
+	 * make sure we exit gracefully rather than endlessly redo
+	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
+	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
+	 */
+	fault = handle_mm_fault(mm, vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
+
+	/*
+	 * If we need to retry the mmap_sem has already been released,
+	 * and if there is a fatal signal pending there is no guarantee
+	 * that we made any progress. Handle this case first.
+	 */
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		/* Retry at most once */
+		if (flags & FAULT_FLAG_ALLOW_RETRY) {
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
+			if (!fatal_signal_pending(tsk))
+				goto retry;
+		}
+
+		/* User mode? Just return to handle the fatal exception */
+		if (flags & FAULT_FLAG_USER)
+			return FAULT_NO_ERR;
+
+		/* Not returning to user mode? Handle exceptions or die: */
+		/* XXX mimmic x86, but might not be best */
+		return fault | VM_FAULT_SIGBUS;
+	}
+
+	up_read(&mm->mmap_sem);
+	if (unlikely(fault & VM_FAULT_ERROR))
+		return mm_fault_error(regs, flags, address, fault);
+
+	/*
+	 * Major/minor page fault accounting. If any of the events
+	 * returned VM_FAULT_MAJOR, we account it as a major fault.
+	 */
+	if (major) {
+		tsk->maj_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+
+		/* Restore the major flag, it might have been lost in case of
+		 * retry and the arch might care
+		 */
+		if (major)
+			fault |= VM_FAULT_MAJOR;
+	} else {
+		tsk->min_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
+	}
+	return fault;
+
+ bad_area:
+	up_read(&mm->mmap_sem);
+	return VM_FAULT_SIGSEGV;
+ bad_access:
+	up_read(&mm->mmap_sem);
+	return VM_FAULT_SIGSEGV | VM_FAULT_ACCESS;
+}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-02-28 23:02                               ` Benjamin Herrenschmidt
@ 2015-03-01  0:41                                 ` Linus Torvalds
  -1 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-03-01  0:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Anyway, here's the current patch:

Ok, I think I like this approach better.

Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Btw, it's quite possible that we could just do all the PF_PROT
handling at the x86 level, before even calling the generic fault
handler. It's not like we even need the vma or the mm semaphore: if
it's a non-write protection fault, we always SIGSEGV. So why even
bother getting the locks and looking up the page tables etc?

Now, that PF_PROT handling isn't exactly performance-critical, but it
might help to remove the odd x86 special case from the generic code.

                      Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-03-01  0:41                                 ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-03-01  0:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Anyway, here's the current patch:

Ok, I think I like this approach better.

Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Btw, it's quite possible that we could just do all the PF_PROT
handling at the x86 level, before even calling the generic fault
handler. It's not like we even need the vma or the mm semaphore: if
it's a non-write protection fault, we always SIGSEGV. So why even
bother getting the locks and looking up the page tables etc?

Now, that PF_PROT handling isn't exactly performance-critical, but it
might help to remove the odd x86 special case from the generic code.

                      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
  2015-03-01  0:41                                 ` Linus Torvalds
@ 2015-03-01  3:57                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-01  3:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 16:41 -0800, Linus Torvalds wrote:
> On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > Anyway, here's the current patch:
> 
> Ok, I think I like this approach better.
> 
> Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
> VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Ah yes indeed :) I would have caught that pretty quickly once I tried
powerpc.

> Btw, it's quite possible that we could just do all the PF_PROT
> handling at the x86 level, before even calling the generic fault
> handler. It's not like we even need the vma or the mm semaphore: if
> it's a non-write protection fault, we always SIGSEGV. So why even
> bother getting the locks and looking up the page tables etc?

For non-write possibly, though that means we always end up going
through handle_mm_fault for a non-present PTE before we decide
that it wasn't actually accessible.

Not a huge deal I suppose unless something (ab)uses mprotect + emulation
for things like spying on MMIO accesses (been there) or god knows what
other userspace based paging or GC might be out there. It's asking for
trouble but the check in access_error() doesn't hurt much either.

On the other hand, we are all about simplifying code here, aren't
we ? :-) It's not like the early bail-out on x86 will buy us much and
it's going to be more arch specific code since they'll all have a
different variant of PF_PROT.

One thing that worries me a bit, however, with the PF_PROT handling in
the generic case is the case of archs who support
present-but-not-accessible PTEs in "HW" (for us that basically means no
_PAGE_USER), as these might use such fault for NUMA balancing, or
software maintained "young" bit, and in those case, bailing out on
PF_PROT is wrong. I'll have to figure out if that ever happens when I
dig through more archs.

> Now, that PF_PROT handling isn't exactly performance-critical, but it
> might help to remove the odd x86 special case from the generic code.

I don't think it's that odd on x86. If you look at what other archs do,
there's a bit of everything out there. Just powerpc may or may not
support exec permission for example depending on the CPU generation and
when it doesn't it's basically similar to x86. I yet have to fully
understand what ARM does (looks simple, just need to make sure it's what
I think it is and check various ARM generations), sparc has its
oddities, etc...

None of that is terribly urgent and I'm pretty busy so give me a bit
more time now that we have some kind of direction, to go through a few
more archs and see where it takes us. I'll start gathering cross
compilers and qemu debian images in the next few days.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Generic page fault (Was: libsigsegv ....)
@ 2015-03-01  3:57                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 43+ messages in thread
From: Benjamin Herrenschmidt @ 2015-03-01  3:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-arch, linux-kernel, linux-mm

On Sat, 2015-02-28 at 16:41 -0800, Linus Torvalds wrote:
> On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > Anyway, here's the current patch:
> 
> Ok, I think I like this approach better.
> 
> Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
> VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Ah yes indeed :) I would have caught that pretty quickly once I tried
powerpc.

> Btw, it's quite possible that we could just do all the PF_PROT
> handling at the x86 level, before even calling the generic fault
> handler. It's not like we even need the vma or the mm semaphore: if
> it's a non-write protection fault, we always SIGSEGV. So why even
> bother getting the locks and looking up the page tables etc?

For non-write possibly, though that means we always end up going
through handle_mm_fault for a non-present PTE before we decide
that it wasn't actually accessible.

Not a huge deal I suppose unless something (ab)uses mprotect + emulation
for things like spying on MMIO accesses (been there) or god knows what
other userspace based paging or GC might be out there. It's asking for
trouble but the check in access_error() doesn't hurt much either.

On the other hand, we are all about simplifying code here, aren't
we ? :-) It's not like the early bail-out on x86 will buy us much and
it's going to be more arch specific code since they'll all have a
different variant of PF_PROT.

One thing that worries me a bit, however, with the PF_PROT handling in
the generic case is the case of archs who support
present-but-not-accessible PTEs in "HW" (for us that basically means no
_PAGE_USER), as these might use such fault for NUMA balancing, or
software maintained "young" bit, and in those case, bailing out on
PF_PROT is wrong. I'll have to figure out if that ever happens when I
dig through more archs.

> Now, that PF_PROT handling isn't exactly performance-critical, but it
> might help to remove the odd x86 special case from the generic code.

I don't think it's that odd on x86. If you look at what other archs do,
there's a bit of everything out there. Just powerpc may or may not
support exec permission for example depending on the CPU generation and
when it doesn't it's basically similar to x86. I yet have to fully
understand what ARM does (looks simple, just need to make sure it's what
I think it is and check various ARM generations), sparc has its
oddities, etc...

None of that is terribly urgent and I'm pretty busy so give me a bit
more time now that we have some kind of direction, to go through a few
more archs and see where it takes us. I'll start gathering cross
compilers and qemu debian images in the next few days.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-03-01  3:57 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1422361485.6648.71.camel@opensuse.org>
     [not found] ` <54C78756.9090605@suse.cz>
     [not found]   ` <alpine.LSU.2.11.1501271347440.30227@nerf60.vanv.qr>
     [not found]     ` <1422364084.6648.82.camel@opensuse.org>
     [not found]       ` <s5h7fw8hvdp.wl-tiwai@suse.de>
     [not found]         ` <CA+55aFyzy_wYHHnr2gDcYr7qcgOKM2557bRdg6RBa=cxrynd+Q@mail.gmail.com>
2015-01-27 20:57           ` [opensuse-factory] Re: [opensuse-kernel] libsigsegv build fail with kernel 3.18.3 Linus Torvalds
     [not found]             ` <CA+55aFxRnj97rpSQvvzLJhpo7C8TQ-F=eB1Ry2n53AV1rN8mwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-28  7:38               ` [opensuse-factory] " Heiko Carstens
2015-01-28  7:38                 ` [opensuse-factory] Re: [opensuse-kernel] " Heiko Carstens
2015-01-29  6:59             ` Max Filippov
2015-01-29 18:16               ` Linus Torvalds
2015-02-02  0:23                 ` Benjamin Herrenschmidt
2015-02-02  1:09                   ` Linus Torvalds
2015-02-22 23:50                     ` Benjamin Herrenschmidt
2015-02-22 23:50                       ` [opensuse-factory] " Benjamin Herrenschmidt
2015-02-28  7:12                     ` Generic page fault (Was: libsigsegv ....) Benjamin Herrenschmidt
2015-02-28  7:14                       ` Benjamin Herrenschmidt
2015-02-28  7:14                         ` Benjamin Herrenschmidt
2015-02-28 10:36                         ` Benjamin Herrenschmidt
2015-02-28 10:36                           ` Benjamin Herrenschmidt
2015-02-28 19:56                       ` Linus Torvalds
2015-02-28 19:56                         ` Linus Torvalds
2015-02-28 19:58                         ` Linus Torvalds
2015-02-28 19:58                           ` Linus Torvalds
2015-02-28 21:14                         ` Benjamin Herrenschmidt
2015-02-28 21:14                           ` Benjamin Herrenschmidt
2015-02-28 21:49                           ` Linus Torvalds
2015-02-28 21:49                             ` Linus Torvalds
2015-02-28 22:49                             ` Benjamin Herrenschmidt
2015-02-28 22:49                               ` Benjamin Herrenschmidt
2015-02-28 22:16                           ` Benjamin Herrenschmidt
2015-02-28 22:16                             ` Benjamin Herrenschmidt
2015-02-28 22:50                             ` Benjamin Herrenschmidt
2015-02-28 22:50                               ` Benjamin Herrenschmidt
2015-02-28 23:02                             ` Benjamin Herrenschmidt
2015-02-28 23:02                               ` Benjamin Herrenschmidt
2015-03-01  0:41                               ` Linus Torvalds
2015-03-01  0:41                                 ` Linus Torvalds
2015-03-01  3:57                                 ` Benjamin Herrenschmidt
2015-03-01  3:57                                   ` Benjamin Herrenschmidt
     [not found]           ` <CA+55aFyzy_wYHHnr2gDcYr7qcgOKM2557bRdg6RBa=cxrynd+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-27 21:12             ` [opensuse-factory] Re: libsigsegv build fail with kernel 3.18.3 Jan Engelhardt
2015-01-27 21:12               ` [opensuse-factory] Re: [opensuse-kernel] " Jan Engelhardt
2015-01-27 21:32               ` Linus Torvalds
2015-01-27 22:14                 ` Jan Engelhardt
2015-01-27 22:32                   ` Linus Torvalds
2015-01-27 23:13                     ` Jan Engelhardt
2015-01-27 23:53                     ` David Miller
     [not found]                     ` <CA+55aFzguEFfG2REN1soMC+0UJ7GtANfEvMoCNPt0QqmP9LKoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-28  8:48                       ` [opensuse-factory] " Andreas Schwab
2015-01-28  8:48                         ` [opensuse-factory] Re: [opensuse-kernel] " Andreas Schwab

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.