All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-06-24 18:20 Stuart_Hayes
  2005-06-25  2:28 ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart_Hayes @ 2005-06-24 18:20 UTC (permalink / raw)
  To: riel; +Cc: ak, andrea, linux-kernel

Rik Van Riel wrote:
> On Wed, 22 Jun 2005 Stuart_Hayes@Dell.com wrote:
> 
>> My question is this:  when split_large_page() is called, should it
>> make the other 511 PTEs inherit the page attributes from the large
>> page (with the exception of PAGE_PSE, obviously)?
> 
> I suspect it should.

(Copying Andi Kleen & Andrea Arcangeli since they were involved in a
previous discussion of this.  I'm trying to fix the NX handling when
change_page_attr() is called in i386.)

After looking into it further, I agree completely.  I found a thread in
which this was discussed
(http://marc.theaimsgroup.com/?l=linux-kernel&m=109964359124731&w=2),
and discovered that this has been addressed in the X86_64 kernel.

I modified pageattr.c so that small pages would inherit the _PAGE_NX
setting from the large page when split_large_page() is called, and it
works... but un-doing the change_page_attr() is still a problem:

I wrote a test module that calls __get_free_pages() to get a single
page.  The page I got was part of a large page, which was executable.

I then did change_page_attr(page,PAGE_KERNEL_NOCACHE)--with my patch to
pageattr.c, this worked fine.  The large page was split, and the other
511 new PTEs were still executable, though my page was no longer
executable (since PAGE_KERNEL_NOCACHE has _PAGE_NX set).

However, when I used "change_page_attr()" to change the page to
PAGE_KERNEL, it did just that.  But PAGE_KERNEL has the _PAGE_NX bit set
and isn't executable.  And, since PAGE_KERNEL (with _PAGE_NX set) didn't
match the original pages attributes, the 512 PTEs weren't reverted back
into a large page.  (Also, __change_page_attr() did *another* get_page()
on the page containing these 512 PTEs, so now the page_count has gone up
to 3, instead of going back down to 1 (or staying at 2).)

Is the function that calls "change_page_attr()" expected to look at the
attributes of the page before calling change_page_attr(), so it knows
how to un-change the attributes when it is finished with the page (since
PAGE_KERNEL isn't always what the page was originally)?

Or should "change_page_attr()" ignore the NX bit in the "pgprot_t prot"
parameter that's passed to it, and just inherit NX from the
large/existing page?  Then change_page_attr(page,PAGE_KERNEL) could be
used to undo changes, but change_page_attr() couldn't be used to modify
the NX bit.

Thanks
Stuart

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-06-24 18:20 page allocation/attributes question (i386/x86_64 specific) Stuart_Hayes
@ 2005-06-25  2:28 ` Andi Kleen
  0 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2005-06-25  2:28 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: riel, ak, andrea, linux-kernel

On Fri, Jun 24, 2005 at 01:20:04PM -0500, Stuart_Hayes@Dell.com wrote:
> Rik Van Riel wrote:
> > On Wed, 22 Jun 2005 Stuart_Hayes@Dell.com wrote:
> > 
> >> My question is this:  when split_large_page() is called, should it
> >> make the other 511 PTEs inherit the page attributes from the large
> >> page (with the exception of PAGE_PSE, obviously)?
> > 
> > I suspect it should.
> 
> (Copying Andi Kleen & Andrea Arcangeli since they were involved in a
> previous discussion of this.  I'm trying to fix the NX handling when
> change_page_attr() is called in i386.)
> 
> After looking into it further, I agree completely.  I found a thread in
> which this was discussed
> (http://marc.theaimsgroup.com/?l=linux-kernel&m=109964359124731&w=2),
> and discovered that this has been addressed in the X86_64 kernel.

You don't say which kernel you were working against. 2.6.11 has
some fixes in this area (in particular with the reference counts) 

> However, when I used "change_page_attr()" to change the page to
> PAGE_KERNEL, it did just that.  But PAGE_KERNEL has the _PAGE_NX bit set
> and isn't executable.  And, since PAGE_KERNEL (with _PAGE_NX set) didn't
> match the original pages attributes, the 512 PTEs weren't reverted back
> into a large page.  (Also, __change_page_attr() did *another* get_page()
> on the page containing these 512 PTEs, so now the page_count has gone up
> to 3, instead of going back down to 1 (or staying at 2).)

That should be already fixed.
> 
> Is the function that calls "change_page_attr()" expected to look at the
> attributes of the page before calling change_page_attr(), so it knows
> how to un-change the attributes when it is finished with the page (since
> PAGE_KERNEL isn't always what the page was originally)?

No, it's not supposed to do such checks on its own.

> 
> Or should "change_page_attr()" ignore the NX bit in the "pgprot_t prot"
> parameter that's passed to it, and just inherit NX from the
> large/existing page?  Then change_page_attr(page,PAGE_KERNEL) could be
> used to undo changes, but change_page_attr() couldn't be used to modify
> the NX bit.

I don't think that makes sense. It should exactly set the protection
the caller requested and revert when the protection is exactly like
it was before.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-07-20 15:06 Stuart_Hayes
@ 2005-07-20 20:45 ` Ingo Molnar
  0 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2005-07-20 20:45 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: ak, riel, andrea, linux-kernel


* Stuart_Hayes@Dell.com <Stuart_Hayes@Dell.com> wrote:

> Oh, sorry, we're talking about two different patches.  I sent in a 
> different patch yesterday, because Andi Kleen didn't seem very 
> enthusiastic about fixnx2.patch.  Here's the patch that I sent 
> yesterday (attached as file init.c.patch).

ah - ok - this patch indeed should solve it.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-20 19:24 Stuart_Hayes
  0 siblings, 0 replies; 25+ messages in thread
From: Stuart_Hayes @ 2005-07-20 19:24 UTC (permalink / raw)
  To: mingo; +Cc: ak, riel, andrea, linux-kernel

Hayes, Stuart wrote:
> Ingo Molnar wrote:
>> * Stuart_Hayes@Dell.com <Stuart_Hayes@Dell.com> wrote:
>> 
>>> Ingo Molnar wrote:
>>>> there's one problem with the patch: it breaks things that need the
>>>> low 1MB executable (e.g. APM bios32 calls). It would at a minimum
>>>> be needed to exclude the BIOS area in 0xd0000-0xfffff.
>>>> 
>>>> 	Ingo
>>> 
>>> I wrote it to make everything below 1MB executable, if it isn't RAM
>>> according to the e820 map, which should include the BIOS area.  This
>>> includes 0xd0000-0xffff on my system.  Do you think I should
>>> explicity make 0xd0000-0xfffff executable regardless of the e820
>>> map? 
>> 
>> hm ... which portion does this? I'm looking at fixnx2.patch. I
>> definitely saw a APM bootup crash due to this, but that was on a
>> 2.4-ish backport of the patch.
>> 
>> 	Ingo
> 
> Oh, sorry, we're talking about two different patches.  I sent in a
> different patch yesterday, because Andi Kleen didn't seem very
> enthusiastic about fixnx2.patch.  Here's the patch that I sent
> yesterday (attached as file init.c.patch).   
> 
> Thanks
> Stuart

Although... if you like the fixnx2.patch better, I can modify it.  I'm
ok with either approach.

Stuart

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-20 15:06 Stuart_Hayes
  2005-07-20 20:45 ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart_Hayes @ 2005-07-20 15:06 UTC (permalink / raw)
  To: mingo; +Cc: ak, riel, andrea, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1061 bytes --]

Ingo Molnar wrote:
> * Stuart_Hayes@Dell.com <Stuart_Hayes@Dell.com> wrote:
> 
>> Ingo Molnar wrote:
>>> there's one problem with the patch: it breaks things that need the
>>> low 1MB executable (e.g. APM bios32 calls). It would at a minimum be
>>> needed to exclude the BIOS area in 0xd0000-0xfffff.
>>> 
>>> 	Ingo
>> 
>> I wrote it to make everything below 1MB executable, if it isn't RAM
>> according to the e820 map, which should include the BIOS area.  This
>> includes 0xd0000-0xffff on my system.  Do you think I should
>> explicity make 0xd0000-0xfffff executable regardless of the e820 map?
> 
> hm ... which portion does this? I'm looking at fixnx2.patch. I
> definitely saw a APM bootup crash due to this, but that was on a
> 2.4-ish backport of the patch.  
> 
> 	Ingo

Oh, sorry, we're talking about two different patches.  I sent in a
different patch yesterday, because Andi Kleen didn't seem very
enthusiastic about fixnx2.patch.  Here's the patch that I sent yesterday
(attached as file init.c.patch).

Thanks
Stuart

[-- Attachment #2: init.c.patch --]
[-- Type: application/octet-stream, Size: 3919 bytes --]

--- 2.6.12-a/arch/i386/mm/init.c	2005-07-19 14:41:14.000000000 -0500
+++ 2.6.12-b/arch/i386/mm/init.c	2005-07-19 14:40:22.000000000 -0500
@@ -128,12 +128,7 @@ static void __init page_table_range_init
 	}
 }
 
-static inline int is_kernel_text(unsigned long addr)
-{
-	if (addr >= PAGE_OFFSET && addr <= (unsigned long)__init_end)
-		return 1;
-	return 0;
-}
+static pgprot_t page_refprot(unsigned int);
 
 /*
  * This maps the physical memory to kernel virtual address space, a total 
@@ -158,25 +153,18 @@ static void __init kernel_physical_mappi
 			continue;
 		for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) {
 			unsigned int address = pfn * PAGE_SIZE + PAGE_OFFSET;
+			pgprot_t ref_prot = page_refprot(address);
 
 			/* Map with big pages if possible, otherwise create normal page tables. */
-			if (cpu_has_pse) {
-				unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-
-				if (is_kernel_text(address) || is_kernel_text(address2))
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
-				else
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+			if (pgprot_val(ref_prot) & _PAGE_PSE) { 
+				set_pmd(pmd, pfn_pmd(pfn, ref_prot));
 				pfn += PTRS_PER_PTE;
 			} else {
 				pte = one_page_table_init(pmd);
 
-				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) {
-						if (is_kernel_text(address))
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
-						else
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
-				}
+				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; 
+				     address += PAGE_SIZE, pte++, pfn++, pte_ofs++)
+						set_pte(pte, pfn_pte(pfn, page_refprot(address)));
 			}
 		}
 	}
@@ -229,6 +217,56 @@ static inline int page_is_ram(unsigned l
 	return 0;
 }
 
+/*
+ * page_refprot() returns PTE attributes used to set up the page tables
+ */
+
+static inline int is_kernel_text (unsigned int addr, unsigned int mask)
+{
+	addr &= mask;
+	if ((addr >= ((unsigned int)_text & mask)) &&
+	    (addr <= ((unsigned int)_etext)))
+		return 1;
+	return 0;
+}
+
+static inline int is_kernel_inittext (unsigned int addr, unsigned int mask)
+{
+	addr &= mask;
+	if ((addr >= ((unsigned int)_sinittext & mask)) &&
+	    (addr <= ((unsigned int)_einittext)))
+		return 1;
+	return 0;
+}
+
+static pgprot_t page_refprot(unsigned int addr)
+{
+	if (nx_enabled &&
+	    (is_kernel_text(addr, LARGE_PAGE_MASK) ||
+	     (is_kernel_inittext(addr, LARGE_PAGE_MASK)) ||
+	     ((addr & LARGE_PAGE_MASK) == 0))) {
+		/* big page area has executable stuff in it--use small pages */
+		if (is_kernel_text(addr, PAGE_MASK) ||
+		   (is_kernel_inittext(addr, PAGE_MASK)) ||
+		   ((__pa(addr) <= 0xfffff) && !page_is_ram(__pa(addr) >> PAGE_SHIFT)))
+			return PAGE_KERNEL_EXEC;
+		else
+			return PAGE_KERNEL;
+	}
+	/* big pages with no executable stuff in them */
+	return cpu_has_pse ? PAGE_KERNEL_LARGE : PAGE_KERNEL;
+}
+
+static inline void set_nx_in_initmem(void)
+{
+	unsigned int addr;
+
+	printk("_text=%p _etext=%p _sinittext=%p _einittext=%p\n",_text,_etext,_sinittext,_einittext);
+	addr = (unsigned long)(_sinittext) & PAGE_MASK;
+	for (; addr <= (unsigned long)(_einittext); addr += PAGE_SIZE)
+		set_kernel_exec(addr, 0);
+}
+
 #ifdef CONFIG_HIGHMEM
 pte_t *kmap_pte;
 pgprot_t kmap_prot;
@@ -436,7 +474,7 @@ static void __init set_nx(void)
  * Enables/disables executability of a given kernel page and
  * returns the previous setting.
  */
-int __init set_kernel_exec(unsigned long vaddr, int enable)
+int set_kernel_exec(unsigned long vaddr, int enable)
 {
 	pte_t *pte;
 	int ret = 1;
@@ -670,6 +708,9 @@ void free_initmem(void)
 {
 	unsigned long addr;
 
+	if (nx_enabled)
+		set_nx_in_initmem();
+
 	addr = (unsigned long)(&__init_begin);
 	for (; addr < (unsigned long)(&__init_end); addr += PAGE_SIZE) {
 		ClearPageReserved(virt_to_page(addr));

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-07-20 14:51 Stuart_Hayes
@ 2005-07-20 15:00 ` Ingo Molnar
  0 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2005-07-20 15:00 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: ak, riel, andrea, linux-kernel


* Stuart_Hayes@Dell.com <Stuart_Hayes@Dell.com> wrote:

> Ingo Molnar wrote:
> > there's one problem with the patch: it breaks things that need the
> > low 1MB executable (e.g. APM bios32 calls). It would at a minimum be
> > needed to exclude the BIOS area in 0xd0000-0xfffff.  
> > 
> > 	Ingo
> 
> I wrote it to make everything below 1MB executable, if it isn't RAM 
> according to the e820 map, which should include the BIOS area.  This 
> includes 0xd0000-0xffff on my system.  Do you think I should explicity 
> make 0xd0000-0xfffff executable regardless of the e820 map?

hm ... which portion does this? I'm looking at fixnx2.patch. I 
definitely saw a APM bootup crash due to this, but that was on a 2.4-ish 
backport of the patch.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-20 14:51 Stuart_Hayes
  2005-07-20 15:00 ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart_Hayes @ 2005-07-20 14:51 UTC (permalink / raw)
  To: mingo; +Cc: ak, riel, andrea, linux-kernel

Ingo Molnar wrote:
> there's one problem with the patch: it breaks things that need the
> low 1MB executable (e.g. APM bios32 calls). It would at a minimum be
> needed to exclude the BIOS area in 0xd0000-0xfffff.  
> 
> 	Ingo

I wrote it to make everything below 1MB executable, if it isn't RAM
according to the e820 map, which should include the BIOS area.  This
includes 0xd0000-0xffff on my system.  Do you think I should explicity
make 0xd0000-0xfffff executable regardless of the e820 map?

Thanks
Stuart

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-07-19 21:20 Stuart Hayes
@ 2005-07-19 22:04 ` Ingo Molnar
  0 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2005-07-19 22:04 UTC (permalink / raw)
  To: Stuart Hayes; +Cc: ak, riel, andrea, linux-kernel


there's one problem with the patch: it breaks things that need the low 
1MB executable (e.g. APM bios32 calls). It would at a minimum be needed 
to exclude the BIOS area in 0xd0000-0xfffff.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-19 21:20 Stuart Hayes
  2005-07-19 22:04 ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart Hayes @ 2005-07-19 21:20 UTC (permalink / raw)
  To: ak; +Cc: ak, riel, andrea, linux-kernel, mingo, stuart_hayes


Andi Kleen wrote:
> I personally wouldn't like doing this NX cleanup very late like you did but instead directly after the early NX setup.

I've thought about it more, and come up with another patch.  All it does is
sets up the PTEs correctly from the beginning, breaking up large pages if
necessary so that only pages that are actually executable kernel code will be
executable.

With this patch, no changes are required for change_page_attr() to work.  The
disadvantages to this patch are (1) it still relies on the fact that
change_page_attr() won't revert small pages to a large page if the page
containing the PTEs is reserved, and (2) if any init code is in a large page
that contains no non-init code, those PTEs won't be reverted to a large page
when the init memory is freed (though they will be changed to non-executable).

Once this patch is in place, it wouldn't take much to modify change_page_attr()
to use the new page_refprot() function to determine the "normal" setting for
pages.  That could eliminate the need for it to ignore reserved pages, but I
wanted to keep this patch simple.

Anyway, comments appreciated!

Stuart
----


--- 2.6.12-a/arch/i386/mm/init.c	2005-07-19 14:41:14.000000000 -0500
+++ 2.6.12-b/arch/i386/mm/init.c	2005-07-19 14:40:22.000000000 -0500
@@ -128,12 +128,7 @@ static void __init page_table_range_init
 	}
 }
 
-static inline int is_kernel_text(unsigned long addr)
-{
-	if (addr >= PAGE_OFFSET && addr <= (unsigned long)__init_end)
-		return 1;
-	return 0;
-}
+static pgprot_t page_refprot(unsigned int);
 
 /*
  * This maps the physical memory to kernel virtual address space, a total 
@@ -158,25 +153,18 @@ static void __init kernel_physical_mappi
 			continue;
 		for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) {
 			unsigned int address = pfn * PAGE_SIZE + PAGE_OFFSET;
+			pgprot_t ref_prot = page_refprot(address);
 
 			/* Map with big pages if possible, otherwise create normal page tables. */
-			if (cpu_has_pse) {
-				unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-
-				if (is_kernel_text(address) || is_kernel_text(address2))
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
-				else
-					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+			if (pgprot_val(ref_prot) & _PAGE_PSE) { 
+				set_pmd(pmd, pfn_pmd(pfn, ref_prot));
 				pfn += PTRS_PER_PTE;
 			} else {
 				pte = one_page_table_init(pmd);
 
-				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) {
-						if (is_kernel_text(address))
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
-						else
-							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
-				}
+				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; 
+				     address += PAGE_SIZE, pte++, pfn++, pte_ofs++)
+						set_pte(pte, pfn_pte(pfn, page_refprot(address)));
 			}
 		}
 	}
@@ -229,6 +217,56 @@ static inline int page_is_ram(unsigned l
 	return 0;
 }
 
+/*
+ * page_refprot() returns PTE attributes used to set up the page tables
+ */
+
+static inline int is_kernel_text (unsigned int addr, unsigned int mask)
+{
+	addr &= mask;
+	if ((addr >= ((unsigned int)_text & mask)) &&
+	    (addr <= ((unsigned int)_etext)))
+		return 1;
+	return 0;
+}
+
+static inline int is_kernel_inittext (unsigned int addr, unsigned int mask)
+{
+	addr &= mask;
+	if ((addr >= ((unsigned int)_sinittext & mask)) &&
+	    (addr <= ((unsigned int)_einittext)))
+		return 1;
+	return 0;
+}
+
+static pgprot_t page_refprot(unsigned int addr)
+{
+	if (nx_enabled &&
+	    (is_kernel_text(addr, LARGE_PAGE_MASK) ||
+	     (is_kernel_inittext(addr, LARGE_PAGE_MASK)) ||
+	     ((addr & LARGE_PAGE_MASK) == 0))) {
+		/* big page area has executable stuff in it--use small pages */
+		if (is_kernel_text(addr, PAGE_MASK) ||
+		   (is_kernel_inittext(addr, PAGE_MASK)) ||
+		   ((__pa(addr) <= 0xfffff) && !page_is_ram(__pa(addr) >> PAGE_SHIFT)))
+			return PAGE_KERNEL_EXEC;
+		else
+			return PAGE_KERNEL;
+	}
+	/* big pages with no executable stuff in them */
+	return cpu_has_pse ? PAGE_KERNEL_LARGE : PAGE_KERNEL;
+}
+
+static inline void set_nx_in_initmem(void)
+{
+	unsigned int addr;
+
+	printk("_text=%p _etext=%p _sinittext=%p _einittext=%p\n",_text,_etext,_sinittext,_einittext);
+	addr = (unsigned long)(_sinittext) & PAGE_MASK;
+	for (; addr <= (unsigned long)(_einittext); addr += PAGE_SIZE)
+		set_kernel_exec(addr, 0);
+}
+
 #ifdef CONFIG_HIGHMEM
 pte_t *kmap_pte;
 pgprot_t kmap_prot;
@@ -436,7 +474,7 @@ static void __init set_nx(void)
  * Enables/disables executability of a given kernel page and
  * returns the previous setting.
  */
-int __init set_kernel_exec(unsigned long vaddr, int enable)
+int set_kernel_exec(unsigned long vaddr, int enable)
 {
 	pte_t *pte;
 	int ret = 1;
@@ -670,6 +708,9 @@ void free_initmem(void)
 {
 	unsigned long addr;
 
+	if (nx_enabled)
+		set_nx_in_initmem();
+
 	addr = (unsigned long)(&__init_begin);
 	for (; addr < (unsigned long)(&__init_end); addr += PAGE_SIZE) {
 		ClearPageReserved(virt_to_page(addr));

----- End forwarded message -----

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-07 17:21 Stuart_Hayes
  0 siblings, 0 replies; 25+ messages in thread
From: Stuart_Hayes @ 2005-07-07 17:21 UTC (permalink / raw)
  To: ak; +Cc: riel, andrea, linux-kernel, mingo

>> 
>> Andi--
>> 
>> I made another pass at this.  This does roughly the same thing, but
>> it doesn't create the new "change_page_attr_perm()" functions.  With
>> this patch, the change to init.c (cleanup_nx_in_kerneltext()) is
>> optional. 
>> I changed __change_page_attr() so that, if the page to be changed is
>> part of a large executable page, it splits the page up *keeping the
>> executability of the extra 511 pages*, and then marks the new PTE
>> page as reserved so that it won't be reverted.
>> 
>> So, basically, without the changes to init.c, the NX bits for data in
>> the first two big pages won't get fixed until someone calls
>> change_page_attr() on them.  If NX is disabled, these patches have
>> no functional effect at all. 
>> 
>> How does this look?
> 
> If anything you would need to ask Ingo who did the i386 NX code
> (cc'ed) 
> 
> I personally wouldn't like doing this NX cleanup very late like you
> did but instead directly after the early NX setup. 
> 

I started to do this earlier, in kernel_physical_mapping_init(), where
the PTEs are set up in the first place.  However, at that point, the
__init code still needs to be executable, and the first 1MB of memory
from PAGE_OFFSET to _text also (apparently) needs to be executable.

But I could change it to set it up right from the start, then change
those ranges back to NX after they no longer need to be executable.

> 
>> Thanks!
>> Stuart
>> 
>> -----
>> 
>> diff -purN linux-2.6.12grep/arch/i386/mm/init.c
>> linux-2.6.12/arch/i386/mm/init.c
>> --- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01
>> 15:09:27.000000000 -0500 +++
>> linux-2.6.12/arch/i386/mm/init.c	2005-07-05 14:32:57.000000000
-0500
>>  @@ -666,6 +666,28 @@ static int noinline do_test_wp_bit(void) 
>> return flag; } 
>> 
>> +/*
>> + * In kernel_physical_mapping_init(), any big pages that contained
>> kernel text area were + * set up as big executable pages.  This
>> function should be called when the initmem + * is freed, to
>> correctly set up the executable & non-executable pages in this area.
>> + */ +static void cleanup_nx_in_kerneltext(void)
>> +{
>> +	unsigned long from, to;
>> +
>> +	if (!nx_enabled) return;
>> +
>> +	from = PAGE_OFFSET;
>> +	to = (unsigned long)_text & PAGE_MASK;
>> +	for (; from<to; from += PAGE_SIZE)
>> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); +
>> +	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
>> +	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) &
>> LARGE_PAGE_MASK;
>> +	for (; from<to; from += PAGE_SIZE)
>> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); +}
>> +
>>  void free_initmem(void)
>>  {
>>  	unsigned long addr;
>> @@ -679,6 +701,8 @@ void free_initmem(void)
>>  		totalram_pages++;
>>  	}
>>  	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n",
>> (__init_end - __init_begin) >> 10);
>> +
>> +	cleanup_nx_in_kerneltext();
>>  }
>> 
>>  #ifdef CONFIG_BLK_DEV_INITRD
>> diff -purN linux-2.6.12grep/arch/i386/mm/pageattr.c
>> linux-2.6.12/arch/i386/mm/pageattr.c
>> --- linux-2.6.12grep/arch/i386/mm/pageattr.c	2005-07-01
>> 15:09:08.000000000 -0500 +++
>> linux-2.6.12/arch/i386/mm/pageattr.c	2005-07-05 14:44:53.000000000
>>          -0500 @@ -35,7 +35,8 @@ pte_t *lookup_address(unsigned long
>>  addr return pte_offset_kernel(pmd, address); }
>> 
>> -static struct page *split_large_page(unsigned long address,
>> pgprot_t prot) +static struct page *split_large_page(unsigned long
>> address, pgprot_t prot, +					pgprot_t
ref_prot)
>>  {
>>  	int i;
>>  	unsigned long addr;
>> @@ -53,7 +54,7 @@ static struct page *split_large_page(uns
>>  	pbase = (pte_t *)page_address(base);
>>  	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
>>  		pbase[i] = pfn_pte(addr >> PAGE_SHIFT,
>> -				   addr == address ? prot :
>> PAGE_KERNEL);
>> +				   addr == address ? prot : ref_prot);
>>  	}
>>  	return base;
>>  }
>> @@ -118,11 +119,30 @@ __change_page_attr(struct page *page, pg
if
>>  		(!kpte) return -EINVAL;
>>  	kpte_page = virt_to_page(kpte);
>> +
>> +	/*
>> +	 * If this page is part of a large page that's executable (and
>> NX is
>> +	 * enabled), then split page up and set new PTE page as reserved
>> so
>> +	 * we won't revert this back into a large page.  This should
>> only
>> +	 * happen in large pages that also contain kernel executable
>> code,
>> +	 * and shouldn't happen at all if init.c correctly sets up NX
>> regions. +	 */
>> +	if (nx_enabled &&
>> +	    !(pte_val(*kpte) & _PAGE_NX) &&
>> +	    (pte_val(*kpte) & _PAGE_PSE)) {
>> +		struct page *split = split_large_page(address, prot,
>> PAGE_KERNEL_EXEC); +		if (!split)
>> +			return -ENOMEM;
>> +		set_pmd_pte(kpte,address,mk_pte(split,
>> PAGE_KERNEL_EXEC));
>> +		SetPageReserved(split);
> 
> Why setting reserved? I don't think cpa checks that anywhere.
> In fact Nick has been working on getting rid of Reserved completely.
> 
> 
> Anyways, isn't the code from x86-64 for this enough? It should
> work here too. I don't think such a ugly special case is needed.
> 
> 
> -Andi

I set it reserved so that the large page that was just split wouldn't
get reverted back into a large page ever again (the __change_page_attr()
code won't do that if the PTE page is reserved).  I guess this wouldn't
really be necessary, though, if I make sure that the page count works
right in this case... with this patch the counting won't work right
since the cleanup_nx_in_kerneltext() would be calling
change_page_attr(page,PAGE_KERNEL) on pages that weren't previously
changed to something else.

I still don't think the code from x86_64 would work, because, in x86_64,
kernel_text is mapped to a different virtual address from where data is
accessed.  In x86_64, change_page_attr() changes the PTE attributes as
requested where data is accessed, and, if the same physical address is
in kernel text, it *also* changes the attributes in kernel text.  For
the kernel text addresses, the NX bit in the requested prot is ignored,
and NX is never set.

Because i386 doesn't have a separate virtual address space for the
kernel text and data areas, we couldn't do exactly that.  But we could
do something *similar* to that, where we use either PAGE_KERNEL or
PAGE_KERNEL_EXEC as the "ref_prot", depending on whether the address in
question has any executable kernel code residing in the same large page
as it.  Unless we simply ignored the NX bit in the requested prot,
though, the page still wouldn't get reverted back into a large
(executable) page, so it would have the same affect as this path--i.e.,
large executable pages would get split up the first time anyone called
change_page_attr() on them, and they wouldn't get recombined into a
large page.  It would move the ugly special case above into an address
check in "change_page_attr()".

Thanks again for the feedback!  I'll think about it some more and wait
for comments from Ingo, then try again.


> 
>> +		return 0;
>> +	}
>> +
>>  	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
>>  		if ((pte_val(*kpte) & _PAGE_PSE) == 0) {
>>  			set_pte_atomic(kpte, mk_pte(page, prot));
>>  		} else {
>> -			struct page *split = split_large_page(address,
>> prot);
>> +			struct page *split = split_large_page(address,
>> prot, PAGE_KERNEL);
>>  			if (!split)
>>  				return -ENOMEM;
>>  			set_pmd_pte(kpte,address,mk_pte(split,
>> PAGE_KERNEL));

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-07-05 20:02 Stuart_Hayes
  2005-07-05 21:27 ` randy_dunlap
@ 2005-07-07 14:33 ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2005-07-07 14:33 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: ak, riel, andrea, linux-kernel, mingo

On Tue, Jul 05, 2005 at 03:02:26PM -0500, Stuart_Hayes@Dell.com wrote:
[Sorry for  the late answer]

> Hayes, Stuart wrote:
> >> So, if I understand correctly what's going on in x86_64, your fix
> >> wouldn't be applicable to i386.  In x86_64, every large page has a
> >> correct "ref_prot" that is the normal setting for that page... but in
> >> i386, the kernel text area does not--it should ideally be split into
> >> small pages all the time if there are both kernel code & free pages
> >> residing in the same 2M area. 
> >> 
> >> Stuart
> > 
> > (This isn't a submission--I'm just posting this for comments.)
> > 
> > Right now, any large page that touches anywhere from PAGE_OFFSET to
> > __init_end is initially set up as a large, executable page... but
> > some of this area contains data & free pages.  The patch below adds a
> > "cleanup_nx_in_kerneltext()" function, called at the end of
> > free_initmem(), which changes these pages--except for the range from
> > "_text" to "_etext"--to PAGE_KERNEL (i.e., non-executable).     
> > 
> > This does result in two large pages being split up into small PTEs
> > permanently, but all the non-code regions will be non-executable, and
> > change_page_attr() will work correctly.  
> > 
> > What do you think of this?  I have tested this on 2.6.12.
> > 
> > (I've attached the patch as a file, too, since my mail server can't
> > be convinced to not wrap text.) 
> > 
> > Stuart
> > 
> 
> Andi--
> 
> I made another pass at this.  This does roughly the same thing, but it
> doesn't create the new "change_page_attr_perm()" functions.  With this
> patch, the change to init.c (cleanup_nx_in_kerneltext()) is optional.  I
> changed __change_page_attr() so that, if the page to be changed is part
> of a large executable page, it splits the page up *keeping the
> executability of the extra 511 pages*, and then marks the new PTE page
> as reserved so that it won't be reverted.
> 
> So, basically, without the changes to init.c, the NX bits for data in
> the first two big pages won't get fixed until someone calls
> change_page_attr() on them.  If NX is disabled, these patches have no
> functional effect at all.
> 
> How does this look?

If anything you would need to ask Ingo who did the i386 NX code (cc'ed) 

I personally wouldn't like doing this NX cleanup very late like you did 
but instead directly after the early NX setup.



> Thanks!
> Stuart
> 
> -----
> 
> diff -purN linux-2.6.12grep/arch/i386/mm/init.c
> linux-2.6.12/arch/i386/mm/init.c
> --- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01
> 15:09:27.000000000 -0500
> +++ linux-2.6.12/arch/i386/mm/init.c	2005-07-05 14:32:57.000000000
> -0500
> @@ -666,6 +666,28 @@ static int noinline do_test_wp_bit(void)
>  	return flag;
>  }
>  
> +/*
> + * In kernel_physical_mapping_init(), any big pages that contained
> kernel text area were
> + * set up as big executable pages.  This function should be called when
> the initmem
> + * is freed, to correctly set up the executable & non-executable pages
> in this area.
> + */
> +static void cleanup_nx_in_kerneltext(void)
> +{
> +	unsigned long from, to;
> +
> +	if (!nx_enabled) return;
> +
> +	from = PAGE_OFFSET;
> +	to = (unsigned long)_text & PAGE_MASK;
> +	for (; from<to; from += PAGE_SIZE)
> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
> +	
> +	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
> +	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) &
> LARGE_PAGE_MASK;
> +	for (; from<to; from += PAGE_SIZE)
> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
> +}
> +
>  void free_initmem(void)
>  {
>  	unsigned long addr;
> @@ -679,6 +701,8 @@ void free_initmem(void)
>  		totalram_pages++;
>  	}
>  	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n",
> (__init_end - __init_begin) >> 10);
> +
> +	cleanup_nx_in_kerneltext();
>  }
>  
>  #ifdef CONFIG_BLK_DEV_INITRD
> diff -purN linux-2.6.12grep/arch/i386/mm/pageattr.c
> linux-2.6.12/arch/i386/mm/pageattr.c
> --- linux-2.6.12grep/arch/i386/mm/pageattr.c	2005-07-01
> 15:09:08.000000000 -0500
> +++ linux-2.6.12/arch/i386/mm/pageattr.c	2005-07-05
> 14:44:53.000000000 -0500
> @@ -35,7 +35,8 @@ pte_t *lookup_address(unsigned long addr
>          return pte_offset_kernel(pmd, address);
>  } 
>  
> -static struct page *split_large_page(unsigned long address, pgprot_t
> prot)
> +static struct page *split_large_page(unsigned long address, pgprot_t
> prot, 
> +					pgprot_t ref_prot)
>  { 
>  	int i; 
>  	unsigned long addr;
> @@ -53,7 +54,7 @@ static struct page *split_large_page(uns
>  	pbase = (pte_t *)page_address(base);
>  	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
>  		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
> -				   addr == address ? prot :
> PAGE_KERNEL);
> +				   addr == address ? prot : ref_prot);
>  	}
>  	return base;
>  } 
> @@ -118,11 +119,30 @@ __change_page_attr(struct page *page, pg
>  	if (!kpte)
>  		return -EINVAL;
>  	kpte_page = virt_to_page(kpte);
> +
> +	/*
> +	 * If this page is part of a large page that's executable (and
> NX is
> +	 * enabled), then split page up and set new PTE page as reserved
> so
> +	 * we won't revert this back into a large page.  This should
> only
> +	 * happen in large pages that also contain kernel executable
> code,
> +	 * and shouldn't happen at all if init.c correctly sets up NX
> regions.
> +	 */
> +	if (nx_enabled && 
> +	    !(pte_val(*kpte) & _PAGE_NX) &&
> +	    (pte_val(*kpte) & _PAGE_PSE)) {
> +		struct page *split = split_large_page(address, prot,
> PAGE_KERNEL_EXEC); 
> +		if (!split)
> +			return -ENOMEM;
> +		set_pmd_pte(kpte,address,mk_pte(split,
> PAGE_KERNEL_EXEC));
> +		SetPageReserved(split);

Why setting reserved? I don't think cpa checks that anywhere.
In fact Nick has been working on getting rid of Reserved completely.


Anyways, isn't the code from x86-64 for this enough? It should 
work here too. I don't think such a ugly special case is needed.


-Andi

> +		return 0;
> +	}
> +
>  	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
>  		if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
>  			set_pte_atomic(kpte, mk_pte(page, prot)); 
>  		} else {
> -			struct page *split = split_large_page(address,
> prot); 
> +			struct page *split = split_large_page(address,
> prot, PAGE_KERNEL); 
>  			if (!split)
>  				return -ENOMEM;
>  			set_pmd_pte(kpte,address,mk_pte(split,
> PAGE_KERNEL));



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-07-05 21:35 Stuart_Hayes
@ 2005-07-05 22:00 ` randy_dunlap
  0 siblings, 0 replies; 25+ messages in thread
From: randy_dunlap @ 2005-07-05 22:00 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: ak, riel, andrea, linux-kernel

On Tue, 5 Jul 2005 16:35:51 -0500 Stuart_Hayes@Dell.com wrote:

| randy_dunlap wrote:
| > On Tue, 5 Jul 2005 15:02:26 -0500 Stuart_Hayes@Dell.com wrote:
| > 
| >> Andi--
| >> 
| >> I made another pass at this.  This does roughly the same thing, but
| >> it doesn't create the new "change_page_attr_perm()" functions.  With
| >> this patch, the change to init.c (cleanup_nx_in_kerneltext()) is
| >> optional. I changed __change_page_attr() so that, if the page to be
| >> changed is part of a large executable page, it splits the page up
| >> *keeping the executability of the extra 511 pages*, and then marks
| >> the new PTE page as reserved so that it won't be reverted.
| >> 
| >> So, basically, without the changes to init.c, the NX bits for data in
| >> the first two big pages won't get fixed until someone calls
| >> change_page_attr() on them.  If NX is disabled, these patches have
| >> no functional effect at all. 
| >> 
| >> How does this look?
| > 
| > Look?  It has lots of bad line breaks and other style issues.
| > 
| > But I'll let Andi comment on the technical issues.
| > 
| >> -----
| >> 
| >> diff -purN linux-2.6.12grep/arch/i386/mm/init.c
| >> linux-2.6.12/arch/i386/mm/init.c
| >> --- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01
| >> 15:09:27.000000000 -0500 +++
| >> linux-2.6.12/arch/i386/mm/init.c	2005-07-05 14:32:57.000000000
| -0500
| >>  @@ -666,6 +666,28 @@ static int noinline do_test_wp_bit(void) 
| >> return flag; } 
| >> 
| >> +/*
| >> + * In kernel_physical_mapping_init(), any big pages that contained
| >> kernel text area were + * set up as big executable pages.  This
| >> function should be called + when the initmem
| >> + * is freed, to correctly set up the executable & non-executable +
| >> pages in this area.
| >> + */
| >> +static void cleanup_nx_in_kerneltext(void) {
| >> +	unsigned long from, to;
| >> +
| >> +	if (!nx_enabled) return;
| > 
| > return; on separate line.
| > 
| >> +	from = PAGE_OFFSET;
| >> +	to = (unsigned long)_text & PAGE_MASK;
| >> +	for (; from<to; from += PAGE_SIZE)
| > 	       from < to
| > 
| > (i.e., use the spacebar)
| > 
| >> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); +
| >> +	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
| >> +	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) &
| >> LARGE_PAGE_MASK; +	for (; from<to; from += PAGE_SIZE)
| > 
| > add spaces:    from < to
| > 
| >> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); }
| +
| >>  void free_initmem(void)
| >>  {
| >>  	unsigned long addr;
| > 
| > 
| > ---
| > ~Randy
| 
| Thanks for the feedback.  I know it looks like crap with the line
| wrapping, but I have no way of sending non-automatically-line-broken
| email right now, which is why I attached the file in addition to the
| inline text.  I apologize for that.

My apologies also, I completely missed seeing the attachment.

| I'll correct those cosmetic issues, thanks.

---
~Randy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-05 21:35 Stuart_Hayes
  2005-07-05 22:00 ` randy_dunlap
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart_Hayes @ 2005-07-05 21:35 UTC (permalink / raw)
  To: rdunlap; +Cc: ak, riel, andrea, linux-kernel

randy_dunlap wrote:
> On Tue, 5 Jul 2005 15:02:26 -0500 Stuart_Hayes@Dell.com wrote:
> 
>> Hayes, Stuart wrote:
>>>> So, if I understand correctly what's going on in x86_64, your fix
>>>> wouldn't be applicable to i386.  In x86_64, every large page has a
>>>> correct "ref_prot" that is the normal setting for that page... but
>>>> in i386, the kernel text area does not--it should ideally be split
>>>> into small pages all the time if there are both kernel code & free
>>>> pages residing in the same 2M area.
>>>> 
>>>> Stuart
>>> 
>>> (This isn't a submission--I'm just posting this for comments.)
>>> 
>>> Right now, any large page that touches anywhere from PAGE_OFFSET to
>>> __init_end is initially set up as a large, executable page... but
>>> some of this area contains data & free pages.  The patch below adds
>>> a "cleanup_nx_in_kerneltext()" function, called at the end of
>>> free_initmem(), which changes these pages--except for the range from
>>> "_text" to "_etext"--to PAGE_KERNEL (i.e., non-executable).
>>> 
>>> This does result in two large pages being split up into small PTEs
>>> permanently, but all the non-code regions will be non-executable,
>>> and change_page_attr() will work correctly.
>>> 
>>> What do you think of this?  I have tested this on 2.6.12.
>>> 
>>> (I've attached the patch as a file, too, since my mail server can't
>>> be convinced to not wrap text.)
>>> 
>>> Stuart
>>> 
>> 
>> Andi--
>> 
>> I made another pass at this.  This does roughly the same thing, but
>> it doesn't create the new "change_page_attr_perm()" functions.  With
>> this patch, the change to init.c (cleanup_nx_in_kerneltext()) is
>> optional. I changed __change_page_attr() so that, if the page to be
>> changed is part of a large executable page, it splits the page up
>> *keeping the executability of the extra 511 pages*, and then marks
>> the new PTE page as reserved so that it won't be reverted.
>> 
>> So, basically, without the changes to init.c, the NX bits for data in
>> the first two big pages won't get fixed until someone calls
>> change_page_attr() on them.  If NX is disabled, these patches have
>> no functional effect at all. 
>> 
>> How does this look?
> 
> Look?  It has lots of bad line breaks and other style issues.
> 
> But I'll let Andi comment on the technical issues.
> 
>> -----
>> 
>> diff -purN linux-2.6.12grep/arch/i386/mm/init.c
>> linux-2.6.12/arch/i386/mm/init.c
>> --- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01
>> 15:09:27.000000000 -0500 +++
>> linux-2.6.12/arch/i386/mm/init.c	2005-07-05 14:32:57.000000000
-0500
>>  @@ -666,6 +666,28 @@ static int noinline do_test_wp_bit(void) 
>> return flag; } 
>> 
>> +/*
>> + * In kernel_physical_mapping_init(), any big pages that contained
>> kernel text area were + * set up as big executable pages.  This
>> function should be called + when the initmem
>> + * is freed, to correctly set up the executable & non-executable +
>> pages in this area.
>> + */
>> +static void cleanup_nx_in_kerneltext(void) {
>> +	unsigned long from, to;
>> +
>> +	if (!nx_enabled) return;
> 
> return; on separate line.
> 
>> +	from = PAGE_OFFSET;
>> +	to = (unsigned long)_text & PAGE_MASK;
>> +	for (; from<to; from += PAGE_SIZE)
> 	       from < to
> 
> (i.e., use the spacebar)
> 
>> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); +
>> +	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
>> +	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) &
>> LARGE_PAGE_MASK; +	for (; from<to; from += PAGE_SIZE)
> 
> add spaces:    from < to
> 
>> +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); }
+
>>  void free_initmem(void)
>>  {
>>  	unsigned long addr;
> 
> 
> ---
> ~Randy

Thanks for the feedback.  I know it looks like crap with the line
wrapping, but I have no way of sending non-automatically-line-broken
email right now, which is why I attached the file in addition to the
inline text.  I apologize for that.

I'll correct those cosmetic issues, thanks.

Stuart

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-07-05 20:02 Stuart_Hayes
@ 2005-07-05 21:27 ` randy_dunlap
  2005-07-07 14:33 ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: randy_dunlap @ 2005-07-05 21:27 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: ak, riel, andrea, linux-kernel

On Tue, 5 Jul 2005 15:02:26 -0500 Stuart_Hayes@Dell.com wrote:

| Hayes, Stuart wrote:
| >> So, if I understand correctly what's going on in x86_64, your fix
| >> wouldn't be applicable to i386.  In x86_64, every large page has a
| >> correct "ref_prot" that is the normal setting for that page... but in
| >> i386, the kernel text area does not--it should ideally be split into
| >> small pages all the time if there are both kernel code & free pages
| >> residing in the same 2M area. 
| >> 
| >> Stuart
| > 
| > (This isn't a submission--I'm just posting this for comments.)
| > 
| > Right now, any large page that touches anywhere from PAGE_OFFSET to
| > __init_end is initially set up as a large, executable page... but
| > some of this area contains data & free pages.  The patch below adds a
| > "cleanup_nx_in_kerneltext()" function, called at the end of
| > free_initmem(), which changes these pages--except for the range from
| > "_text" to "_etext"--to PAGE_KERNEL (i.e., non-executable).     
| > 
| > This does result in two large pages being split up into small PTEs
| > permanently, but all the non-code regions will be non-executable, and
| > change_page_attr() will work correctly.  
| > 
| > What do you think of this?  I have tested this on 2.6.12.
| > 
| > (I've attached the patch as a file, too, since my mail server can't
| > be convinced to not wrap text.) 
| > 
| > Stuart
| > 
| 
| Andi--
| 
| I made another pass at this.  This does roughly the same thing, but it
| doesn't create the new "change_page_attr_perm()" functions.  With this
| patch, the change to init.c (cleanup_nx_in_kerneltext()) is optional.  I
| changed __change_page_attr() so that, if the page to be changed is part
| of a large executable page, it splits the page up *keeping the
| executability of the extra 511 pages*, and then marks the new PTE page
| as reserved so that it won't be reverted.
| 
| So, basically, without the changes to init.c, the NX bits for data in
| the first two big pages won't get fixed until someone calls
| change_page_attr() on them.  If NX is disabled, these patches have no
| functional effect at all.
| 
| How does this look?

Look?  It has lots of bad line breaks and other style issues.

But I'll let Andi comment on the technical issues.

| -----
| 
| diff -purN linux-2.6.12grep/arch/i386/mm/init.c
| linux-2.6.12/arch/i386/mm/init.c
| --- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01
| 15:09:27.000000000 -0500
| +++ linux-2.6.12/arch/i386/mm/init.c	2005-07-05 14:32:57.000000000
| -0500
| @@ -666,6 +666,28 @@ static int noinline do_test_wp_bit(void)
|  	return flag;
|  }
|  
| +/*
| + * In kernel_physical_mapping_init(), any big pages that contained
| kernel text area were
| + * set up as big executable pages.  This function should be called when
| the initmem
| + * is freed, to correctly set up the executable & non-executable pages
| in this area.
| + */
| +static void cleanup_nx_in_kerneltext(void)
| +{
| +	unsigned long from, to;
| +
| +	if (!nx_enabled) return;

return; on separate line.

| +	from = PAGE_OFFSET;
| +	to = (unsigned long)_text & PAGE_MASK;
| +	for (; from<to; from += PAGE_SIZE)
	       from < to

(i.e., use the spacebar)

| +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
| +	
| +	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
| +	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) &
| LARGE_PAGE_MASK;
| +	for (; from<to; from += PAGE_SIZE)

add spaces:    from < to

| +		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
| +}
| +
|  void free_initmem(void)
|  {
|  	unsigned long addr;


---
~Randy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-05 20:02 Stuart_Hayes
  2005-07-05 21:27 ` randy_dunlap
  2005-07-07 14:33 ` Andi Kleen
  0 siblings, 2 replies; 25+ messages in thread
From: Stuart_Hayes @ 2005-07-05 20:02 UTC (permalink / raw)
  To: ak; +Cc: riel, andrea, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5643 bytes --]

Hayes, Stuart wrote:
>> So, if I understand correctly what's going on in x86_64, your fix
>> wouldn't be applicable to i386.  In x86_64, every large page has a
>> correct "ref_prot" that is the normal setting for that page... but in
>> i386, the kernel text area does not--it should ideally be split into
>> small pages all the time if there are both kernel code & free pages
>> residing in the same 2M area. 
>> 
>> Stuart
> 
> (This isn't a submission--I'm just posting this for comments.)
> 
> Right now, any large page that touches anywhere from PAGE_OFFSET to
> __init_end is initially set up as a large, executable page... but
> some of this area contains data & free pages.  The patch below adds a
> "cleanup_nx_in_kerneltext()" function, called at the end of
> free_initmem(), which changes these pages--except for the range from
> "_text" to "_etext"--to PAGE_KERNEL (i.e., non-executable).     
> 
> This does result in two large pages being split up into small PTEs
> permanently, but all the non-code regions will be non-executable, and
> change_page_attr() will work correctly.  
> 
> What do you think of this?  I have tested this on 2.6.12.
> 
> (I've attached the patch as a file, too, since my mail server can't
> be convinced to not wrap text.) 
> 
> Stuart
> 

Andi--

I made another pass at this.  This does roughly the same thing, but it
doesn't create the new "change_page_attr_perm()" functions.  With this
patch, the change to init.c (cleanup_nx_in_kerneltext()) is optional.  I
changed __change_page_attr() so that, if the page to be changed is part
of a large executable page, it splits the page up *keeping the
executability of the extra 511 pages*, and then marks the new PTE page
as reserved so that it won't be reverted.

So, basically, without the changes to init.c, the NX bits for data in
the first two big pages won't get fixed until someone calls
change_page_attr() on them.  If NX is disabled, these patches have no
functional effect at all.

How does this look?
Thanks!
Stuart

-----

diff -purN linux-2.6.12grep/arch/i386/mm/init.c
linux-2.6.12/arch/i386/mm/init.c
--- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01
15:09:27.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/init.c	2005-07-05 14:32:57.000000000
-0500
@@ -666,6 +666,28 @@ static int noinline do_test_wp_bit(void)
 	return flag;
 }
 
+/*
+ * In kernel_physical_mapping_init(), any big pages that contained
kernel text area were
+ * set up as big executable pages.  This function should be called when
the initmem
+ * is freed, to correctly set up the executable & non-executable pages
in this area.
+ */
+static void cleanup_nx_in_kerneltext(void)
+{
+	unsigned long from, to;
+
+	if (!nx_enabled) return;
+
+	from = PAGE_OFFSET;
+	to = (unsigned long)_text & PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
+	
+	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
+	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) &
LARGE_PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
+}
+
 void free_initmem(void)
 {
 	unsigned long addr;
@@ -679,6 +701,8 @@ void free_initmem(void)
 		totalram_pages++;
 	}
 	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n",
(__init_end - __init_begin) >> 10);
+
+	cleanup_nx_in_kerneltext();
 }
 
 #ifdef CONFIG_BLK_DEV_INITRD
diff -purN linux-2.6.12grep/arch/i386/mm/pageattr.c
linux-2.6.12/arch/i386/mm/pageattr.c
--- linux-2.6.12grep/arch/i386/mm/pageattr.c	2005-07-01
15:09:08.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/pageattr.c	2005-07-05
14:44:53.000000000 -0500
@@ -35,7 +35,8 @@ pte_t *lookup_address(unsigned long addr
         return pte_offset_kernel(pmd, address);
 } 
 
-static struct page *split_large_page(unsigned long address, pgprot_t
prot)
+static struct page *split_large_page(unsigned long address, pgprot_t
prot, 
+					pgprot_t ref_prot)
 { 
 	int i; 
 	unsigned long addr;
@@ -53,7 +54,7 @@ static struct page *split_large_page(uns
 	pbase = (pte_t *)page_address(base);
 	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
 		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
-				   addr == address ? prot :
PAGE_KERNEL);
+				   addr == address ? prot : ref_prot);
 	}
 	return base;
 } 
@@ -118,11 +119,30 @@ __change_page_attr(struct page *page, pg
 	if (!kpte)
 		return -EINVAL;
 	kpte_page = virt_to_page(kpte);
+
+	/*
+	 * If this page is part of a large page that's executable (and
NX is
+	 * enabled), then split page up and set new PTE page as reserved
so
+	 * we won't revert this back into a large page.  This should
only
+	 * happen in large pages that also contain kernel executable
code,
+	 * and shouldn't happen at all if init.c correctly sets up NX
regions.
+	 */
+	if (nx_enabled && 
+	    !(pte_val(*kpte) & _PAGE_NX) &&
+	    (pte_val(*kpte) & _PAGE_PSE)) {
+		struct page *split = split_large_page(address, prot,
PAGE_KERNEL_EXEC); 
+		if (!split)
+			return -ENOMEM;
+		set_pmd_pte(kpte,address,mk_pte(split,
PAGE_KERNEL_EXEC));
+		SetPageReserved(split);
+		return 0;
+	}
+
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
 		if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
-			struct page *split = split_large_page(address,
prot); 
+			struct page *split = split_large_page(address,
prot, PAGE_KERNEL); 
 			if (!split)
 				return -ENOMEM;
 			set_pmd_pte(kpte,address,mk_pte(split,
PAGE_KERNEL));

[-- Attachment #2: fixnx2.patch --]
[-- Type: application/octet-stream, Size: 3550 bytes --]

diff -purN --exclude='*orig' --exclude='*o' --exclude='*cmd' linux-2.6.12grep/arch/i386/mm/init.c linux-2.6.12/arch/i386/mm/init.c
--- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01 15:09:27.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/init.c	2005-07-05 14:32:57.000000000 -0500
@@ -666,6 +666,28 @@ static int noinline do_test_wp_bit(void)
 	return flag;
 }
 
+/*
+ * In kernel_physical_mapping_init(), any big pages that contained kernel text area were
+ * set up as big executable pages.  This function should be called when the initmem
+ * is freed, to correctly set up the executable & non-executable pages in this area.
+ */
+static void cleanup_nx_in_kerneltext(void)
+{
+	unsigned long from, to;
+
+	if (!nx_enabled) return;
+
+	from = PAGE_OFFSET;
+	to = (unsigned long)_text & PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
+	
+	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
+	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) & LARGE_PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr(virt_to_page(from), 1, PAGE_KERNEL); 
+}
+
 void free_initmem(void)
 {
 	unsigned long addr;
@@ -679,6 +701,8 @@ void free_initmem(void)
 		totalram_pages++;
 	}
 	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n", (__init_end - __init_begin) >> 10);
+
+	cleanup_nx_in_kerneltext();
 }
 
 #ifdef CONFIG_BLK_DEV_INITRD
diff -purN --exclude='*orig' --exclude='*o' --exclude='*cmd' linux-2.6.12grep/arch/i386/mm/pageattr.c linux-2.6.12/arch/i386/mm/pageattr.c
--- linux-2.6.12grep/arch/i386/mm/pageattr.c	2005-07-01 15:09:08.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/pageattr.c	2005-07-05 14:44:53.000000000 -0500
@@ -35,7 +35,8 @@ pte_t *lookup_address(unsigned long addr
         return pte_offset_kernel(pmd, address);
 } 
 
-static struct page *split_large_page(unsigned long address, pgprot_t prot)
+static struct page *split_large_page(unsigned long address, pgprot_t prot, 
+					pgprot_t ref_prot)
 { 
 	int i; 
 	unsigned long addr;
@@ -53,7 +54,7 @@ static struct page *split_large_page(uns
 	pbase = (pte_t *)page_address(base);
 	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
 		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
-				   addr == address ? prot : PAGE_KERNEL);
+				   addr == address ? prot : ref_prot);
 	}
 	return base;
 } 
@@ -118,11 +119,30 @@ __change_page_attr(struct page *page, pg
 	if (!kpte)
 		return -EINVAL;
 	kpte_page = virt_to_page(kpte);
+
+	/*
+	 * If this page is part of a large page that's executable (and NX is
+	 * enabled), then split page up and set new PTE page as reserved so
+	 * we won't revert this back into a large page.  This should only
+	 * happen in large pages that also contain kernel executable code,
+	 * and shouldn't happen at all if init.c correctly sets up NX regions.
+	 */
+	if (nx_enabled && 
+	    !(pte_val(*kpte) & _PAGE_NX) &&
+	    (pte_val(*kpte) & _PAGE_PSE)) {
+		struct page *split = split_large_page(address, prot, PAGE_KERNEL_EXEC); 
+		if (!split)
+			return -ENOMEM;
+		set_pmd_pte(kpte,address,mk_pte(split, PAGE_KERNEL_EXEC));
+		SetPageReserved(split);
+		return 0;
+	}
+
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
 		if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
-			struct page *split = split_large_page(address, prot); 
+			struct page *split = split_large_page(address, prot, PAGE_KERNEL); 
 			if (!split)
 				return -ENOMEM;
 			set_pmd_pte(kpte,address,mk_pte(split, PAGE_KERNEL));

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-07-01 20:33 Stuart_Hayes
  0 siblings, 0 replies; 25+ messages in thread
From: Stuart_Hayes @ 2005-07-01 20:33 UTC (permalink / raw)
  To: ak; +Cc: riel, andrea, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5598 bytes --]

> 
> So, if I understand correctly what's going on in x86_64, your fix
> wouldn't be applicable to i386.  In x86_64, every large page has a
> correct "ref_prot" that is the normal setting for that page... but in
> i386, the kernel text area does not--it should ideally be split into
> small pages all the time if there are both kernel code & free pages
> residing in the same 2M area.     
> 
> Stuart

(This isn't a submission--I'm just posting this for comments.)

Right now, any large page that touches anywhere from PAGE_OFFSET to
__init_end is initially set up as a large, executable page... but some
of this area contains data & free pages.  The patch below adds a
"cleanup_nx_in_kerneltext()" function, called at the end of
free_initmem(), which changes these pages--except for the range from
"_text" to "_etext"--to PAGE_KERNEL (i.e., non-executable).

This does result in two large pages being split up into small PTEs
permanently, but all the non-code regions will be non-executable, and
change_page_attr() will work correctly.

What do you think of this?  I have tested this on 2.6.12.

(I've attached the patch as a file, too, since my mail server can't be
convinced to not wrap text.)

Stuart

-----


diff -purN --exclude='*.o' --exclude='*.cmd'
linux-2.6.12grep/arch/i386/mm/init.c linux-2.6.12/arch/i386/mm/init.c
--- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01
15:09:27.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/init.c	2005-07-01 15:13:06.000000000
-0500
@@ -666,6 +666,30 @@ static int noinline do_test_wp_bit(void)
 	return flag;
 }
 
+extern int change_page_attr_perm(struct page *, int, pgprot_t);
+
+/*
+ * In kernel_physical_mapping_init(), any big pages that contained
kernel text area were
+ * set up as big executable pages.  This function should be called when
the initmem
+ * is freed, to correctly set up the executable & non-executable pages
in this area.
+ */
+static void cleanup_nx_in_kerneltext(void)
+{
+	unsigned long from, to;
+
+	if (!nx_enabled) return;
+
+	from = PAGE_OFFSET;
+	to = (unsigned long)_text & PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr_perm(virt_to_page(from), 1,
PAGE_KERNEL); 
+	
+	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
+	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) &
LARGE_PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr_perm(virt_to_page(from), 1,
PAGE_KERNEL); 
+}
+
 void free_initmem(void)
 {
 	unsigned long addr;
@@ -679,6 +703,8 @@ void free_initmem(void)
 		totalram_pages++;
 	}
 	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n",
(__init_end - __init_begin) >> 10);
+
+	cleanup_nx_in_kerneltext();
 }
 
 #ifdef CONFIG_BLK_DEV_INITRD
diff -purN --exclude='*.o' --exclude='*.cmd'
linux-2.6.12grep/arch/i386/mm/pageattr.c
linux-2.6.12/arch/i386/mm/pageattr.c
--- linux-2.6.12grep/arch/i386/mm/pageattr.c	2005-07-01
15:09:08.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/pageattr.c	2005-07-01
14:56:06.000000000 -0500
@@ -35,7 +35,7 @@ pte_t *lookup_address(unsigned long addr
         return pte_offset_kernel(pmd, address);
 } 
 
-static struct page *split_large_page(unsigned long address, pgprot_t
prot)
+static struct page *split_large_page(unsigned long address, pgprot_t
prot, pgprot_t ref_prot)
 { 
 	int i; 
 	unsigned long addr;
@@ -53,7 +53,7 @@ static struct page *split_large_page(uns
 	pbase = (pte_t *)page_address(base);
 	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
 		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
-				   addr == address ? prot :
PAGE_KERNEL);
+				   addr == address ? prot : ref_prot);
 	}
 	return base;
 } 
@@ -122,7 +122,7 @@ __change_page_attr(struct page *page, pg
 		if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
-			struct page *split = split_large_page(address,
prot); 
+			struct page *split = split_large_page(address,
prot, PAGE_KERNEL); 
 			if (!split)
 				return -ENOMEM;
 			set_pmd_pte(kpte,address,mk_pte(split,
PAGE_KERNEL));
@@ -152,6 +152,38 @@ __change_page_attr(struct page *page, pg
 	return 0;
 } 
 
+static int __change_page_attr_perm (struct page *page, pgprot_t prot)
+{ 
+	pte_t *kpte; 
+	unsigned long address;
+	struct page *kpte_page;
+
+	BUG_ON(PageHighMem(page));
+	address = (unsigned long)page_address(page);
+
+	kpte = lookup_address(address);
+	if (!kpte)
+		return -EINVAL;
+	kpte_page = virt_to_page(kpte);
+
+	if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
+		set_pte_atomic(kpte, mk_pte(page, prot)); 
+	} else {
+		pgprot_t ref_prot;
+
+		if ((pte_val(*kpte) & _PAGE_NX))
+			ref_prot = PAGE_KERNEL;
+		else
+			ref_prot = PAGE_KERNEL_EXEC;
+		kpte_page = split_large_page(address, prot, ref_prot);
+		if (!kpte_page)
+			return -ENOMEM;
+		set_pmd_pte(kpte,address,mk_pte(kpte_page, ref_prot));
+	}	
+	SetPageReserved(kpte_page);
+	return 0;
+} 
+
 static inline void flush_map(void)
 {
 	on_each_cpu(flush_kernel_map, NULL, 1, 1);
@@ -186,6 +218,22 @@ int change_page_attr(struct page *page, 
 	return err;
 }
 
+int change_page_attr_perm(struct page *page, int numpages, pgprot_t
prot)
+{
+	int err = 0; 
+	int i; 
+	unsigned long flags;
+
+	spin_lock_irqsave(&cpa_lock, flags);
+	for (i = 0; i < numpages; i++, page++) { 
+		err = __change_page_attr_perm(page, prot);
+		if (err) 
+			break; 
+	} 	
+	spin_unlock_irqrestore(&cpa_lock, flags);
+	return err;
+}
+
 void global_flush_tlb(void)
 { 
 	LIST_HEAD(l);

[-- Attachment #2: pass1.patch --]
[-- Type: application/octet-stream, Size: 4194 bytes --]

diff -purN --exclude='*.o' --exclude='*.cmd' linux-2.6.12grep/arch/i386/mm/init.c linux-2.6.12/arch/i386/mm/init.c
--- linux-2.6.12grep/arch/i386/mm/init.c	2005-07-01 15:09:27.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/init.c	2005-07-01 15:13:06.000000000 -0500
@@ -666,6 +666,30 @@ static int noinline do_test_wp_bit(void)
 	return flag;
 }
 
+extern int change_page_attr_perm(struct page *, int, pgprot_t);
+
+/*
+ * In kernel_physical_mapping_init(), any big pages that contained kernel text area were
+ * set up as big executable pages.  This function should be called when the initmem
+ * is freed, to correctly set up the executable & non-executable pages in this area.
+ */
+static void cleanup_nx_in_kerneltext(void)
+{
+	unsigned long from, to;
+
+	if (!nx_enabled) return;
+
+	from = PAGE_OFFSET;
+	to = (unsigned long)_text & PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr_perm(virt_to_page(from), 1, PAGE_KERNEL); 
+	
+	from = ((unsigned long)_etext + PAGE_SIZE - 1) & PAGE_MASK;
+	to = ((unsigned long)__init_end + LARGE_PAGE_SIZE) & LARGE_PAGE_MASK;
+	for (; from<to; from += PAGE_SIZE)
+		change_page_attr_perm(virt_to_page(from), 1, PAGE_KERNEL); 
+}
+
 void free_initmem(void)
 {
 	unsigned long addr;
@@ -679,6 +703,8 @@ void free_initmem(void)
 		totalram_pages++;
 	}
 	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n", (__init_end - __init_begin) >> 10);
+
+	cleanup_nx_in_kerneltext();
 }
 
 #ifdef CONFIG_BLK_DEV_INITRD
diff -purN --exclude='*.o' --exclude='*.cmd' linux-2.6.12grep/arch/i386/mm/pageattr.c linux-2.6.12/arch/i386/mm/pageattr.c
--- linux-2.6.12grep/arch/i386/mm/pageattr.c	2005-07-01 15:09:08.000000000 -0500
+++ linux-2.6.12/arch/i386/mm/pageattr.c	2005-07-01 14:56:06.000000000 -0500
@@ -35,7 +35,7 @@ pte_t *lookup_address(unsigned long addr
         return pte_offset_kernel(pmd, address);
 } 
 
-static struct page *split_large_page(unsigned long address, pgprot_t prot)
+static struct page *split_large_page(unsigned long address, pgprot_t prot, pgprot_t ref_prot)
 { 
 	int i; 
 	unsigned long addr;
@@ -53,7 +53,7 @@ static struct page *split_large_page(uns
 	pbase = (pte_t *)page_address(base);
 	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
 		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
-				   addr == address ? prot : PAGE_KERNEL);
+				   addr == address ? prot : ref_prot);
 	}
 	return base;
 } 
@@ -122,7 +122,7 @@ __change_page_attr(struct page *page, pg
 		if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
-			struct page *split = split_large_page(address, prot); 
+			struct page *split = split_large_page(address, prot, PAGE_KERNEL); 
 			if (!split)
 				return -ENOMEM;
 			set_pmd_pte(kpte,address,mk_pte(split, PAGE_KERNEL));
@@ -152,6 +152,38 @@ __change_page_attr(struct page *page, pg
 	return 0;
 } 
 
+static int __change_page_attr_perm (struct page *page, pgprot_t prot)
+{ 
+	pte_t *kpte; 
+	unsigned long address;
+	struct page *kpte_page;
+
+	BUG_ON(PageHighMem(page));
+	address = (unsigned long)page_address(page);
+
+	kpte = lookup_address(address);
+	if (!kpte)
+		return -EINVAL;
+	kpte_page = virt_to_page(kpte);
+
+	if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
+		set_pte_atomic(kpte, mk_pte(page, prot)); 
+	} else {
+		pgprot_t ref_prot;
+
+		if ((pte_val(*kpte) & _PAGE_NX))
+			ref_prot = PAGE_KERNEL;
+		else
+			ref_prot = PAGE_KERNEL_EXEC;
+		kpte_page = split_large_page(address, prot, ref_prot);
+		if (!kpte_page)
+			return -ENOMEM;
+		set_pmd_pte(kpte,address,mk_pte(kpte_page, ref_prot));
+	}	
+	SetPageReserved(kpte_page);
+	return 0;
+} 
+
 static inline void flush_map(void)
 {
 	on_each_cpu(flush_kernel_map, NULL, 1, 1);
@@ -186,6 +218,22 @@ int change_page_attr(struct page *page, 
 	return err;
 }
 
+int change_page_attr_perm(struct page *page, int numpages, pgprot_t prot)
+{
+	int err = 0; 
+	int i; 
+	unsigned long flags;
+
+	spin_lock_irqsave(&cpa_lock, flags);
+	for (i = 0; i < numpages; i++, page++) { 
+		err = __change_page_attr_perm(page, prot);
+		if (err) 
+			break; 
+	} 	
+	spin_unlock_irqrestore(&cpa_lock, flags);
+	return err;
+}
+
 void global_flush_tlb(void)
 { 
 	LIST_HEAD(l);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-06-30 19:11 Stuart_Hayes
  0 siblings, 0 replies; 25+ messages in thread
From: Stuart_Hayes @ 2005-06-30 19:11 UTC (permalink / raw)
  To: arjan; +Cc: ak, riel, andrea, linux-kernel

Arjan van de Ven wrote:
> On Thu, 2005-06-30 at 11:56 -0500, Stuart_Hayes@Dell.com wrote:
>> Andi Kleen wrote:
>>> 
>>> I only fixed it for x86-64 correct. Does it work for you on x86-64?
>>> 
>>> If yes then the changes could be brought over.
>>> 
>>> What do you all need is for anyways?
>>> 
>>> -Andi
>> 
>> We need this because the NVidia driver uses change_page_attr() to
>> make pages non-cachable, which is causing systems to spontaneously
>> reboot when it gets a page that's in the first 8MB of memory.
>> 
> 
> that's not a linux problem since there isn't really a linux driver
> that does this ;) 

That's why I didn't mention what failed in the first place!  :P

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
  2005-06-30 16:56 Stuart_Hayes
@ 2005-06-30 17:34 ` Arjan van de Ven
  0 siblings, 0 replies; 25+ messages in thread
From: Arjan van de Ven @ 2005-06-30 17:34 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: ak, riel, andrea, linux-kernel

On Thu, 2005-06-30 at 11:56 -0500, Stuart_Hayes@Dell.com wrote:
> Andi Kleen wrote:
> > 
> > I only fixed it for x86-64 correct. Does it work for you on x86-64?
> > 
> > If yes then the changes could be brought over.
> > 
> > What do you all need is for anyways?
> > 
> > -Andi
> 
> We need this because the NVidia driver uses change_page_attr() to make
> pages non-cachable, which is causing systems to spontaneously reboot
> when it gets a page that's in the first 8MB of memory.
> 

that's not a linux problem since there isn't really a linux driver that
does this ;)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-06-30 17:14 Stuart_Hayes
  0 siblings, 0 replies; 25+ messages in thread
From: Stuart_Hayes @ 2005-06-30 17:14 UTC (permalink / raw)
  To: Stuart_Hayes, ak; +Cc: riel, andrea, linux-kernel

Hayes, Stuart wrote:
> Andi Kleen wrote:
>> 
>> I only fixed it for x86-64 correct. Does it work for you on x86-64?
>> 
>> If yes then the changes could be brought over.
>> 
>> What do you all need this for anyways?
>> 
>> -Andi
> 
> We need this because the NVidia driver uses change_page_attr() to
> make pages non-cachable, which is causing systems to spontaneously
> reboot when it gets a page that's in the first 8MB of memory.  
> 
> I'll look at x86_64.  The problem was seen originally with i386, and
> I haven't taken much time to look at the x86_64 stuff yet. 
> 
> Thanks,
> Stuart

Doesn't x86_64 map the kernel code into a different virtual address
range than the direct mapping of physical memory--and __get_free_pages()
returns a pointer to the direct mapping area rather than to the kernel
text area?  This would prevent the problem, because I assume the entire
direct mapping of physical memory area is set to non-executable.

The problem with i386 occurs because the kernel code and kernel memory
is all mapped to the same virtual address range (at 0xC0000000), so
kernel code, and free pages returned by __get_free_pages(), both reside
in the same large page.

So, if I understand correctly what's going on in x86_64, your fix
wouldn't be applicable to i386.  In x86_64, every large page has a
correct "ref_prot" that is the normal setting for that page... but in
i386, the kernel text area does not--it should ideally be split into
small pages all the time if there are both kernel code & free pages
residing in the same 2M area.

Stuart




^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-06-30 16:56 Stuart_Hayes
  2005-06-30 17:34 ` Arjan van de Ven
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart_Hayes @ 2005-06-30 16:56 UTC (permalink / raw)
  To: ak; +Cc: riel, andrea, linux-kernel

Andi Kleen wrote:
> 
> I only fixed it for x86-64 correct. Does it work for you on x86-64?
> 
> If yes then the changes could be brought over.
> 
> What do you all need this for anyways?
> 
> -Andi

We need this because the NVidia driver uses change_page_attr() to make
pages non-cachable, which is causing systems to spontaneously reboot
when it gets a page that's in the first 8MB of memory.

I'll look at x86_64.  The problem was seen originally with i386, and I
haven't taken much time to look at the x86_64 stuff yet.

Thanks,
Stuart


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-06-28 20:16 Stuart_Hayes
@ 2005-06-30 16:15 ` Andi Kleen
  0 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2005-06-30 16:15 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: ak, riel, andrea, linux-kernel

On Tue, Jun 28, 2005 at 03:16:51PM -0500, Stuart_Hayes@Dell.com wrote:
> >> set and isn't executable.  And, since PAGE_KERNEL (with _PAGE_NX set)
> >> didn't match the original pages attributes, the 512 PTEs weren't
> >> reverted back into a large page.  (Also, __change_page_attr() did
> >> *another* get_page() on the page containing these 512 PTEs, so now
> >> the page_count has gone up to 3, instead of going back down to 1 (or
> >> staying at 2).)
> > 
> > That should be already fixed.
> 
> It doesn't appear to be fixed (in the i386 arch).  The

I only fixed it for x86-64 correct. Does it work for you on x86-64?

If yes then the changes could be brought over.

What do you all need this for anyways?

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-06-29 17:20 Stuart_Hayes
  0 siblings, 0 replies; 25+ messages in thread
From: Stuart_Hayes @ 2005-06-29 17:20 UTC (permalink / raw)
  To: Stuart_Hayes, ak; +Cc: riel, andrea, linux-kernel


>> 
>> That should be already fixed.
> 
> It doesn't appear to be fixed (in the i386 arch).  The
> change_page_attr()/split_large_page() code will still still set all
> the 4K PTEs to PAGE_KERNEL (setting the _PAGE_NX bit) when a large
> page needs to be split.   
> 
> This wouldn't be a problem for the bulk of the kernel memory, but
> there are pages in the lower 4MB of memory that's free, and are part
> of large executable pages that also contain kernel code.  If
> change_page_attr() is called on these, it will set the _PAGE_NX bit
> on the whole 2MB region that was covered by the large page, causing a
> large chunk of kernel code to be non-executable.     
> 
> I can only think of a few ways to fix this:
> 
> (1) Make sure that any free pages that share a large-page-aligned
> area with kernel code are not executable.  This would cause a big
> part of the kernel code to be in small pages, which might have an
> effect on performance.   
> 
> (2) Make sure that there aren't any free pages in the
> large-page-aligned areas that contain kernel executable code, by
> reserving it somehow.  This seems kind of wasteful of
> memory--ZONE_DMA memory, at that.   
> 
> or (3) Allow the free pages in the 2MB large pages that also contain
> kernel code to be executable, but somehow fix change_page_attr() so
> it doesn't set the _PAGE_NX bit on the whole 2MB large page region
> when called.  This would require one of the changes that I suggested
> earlier, like ignoring the NX bit in the "prot" that's passed to
> change_page_attr(), or just accepting the fact that calling
> change_page_attr() on a large executable page will not be reverted
> back to a large page if the calling function uses
> change_page_attr(page,PAGE_KERNEL) when it is done.  Or changing the
> interface, so that there's a change_page_attr() and an
> unchange_page_attr() that automatically reverts the page back to
> whatever the previous attributes were.           
> 
> Am I missing something?
> 

(Fixed Andrea's email address)

This is definitely broken in 2.6.12.  I wrote a test module that does
__get_free_pages(GFP_DMA,0) until it sees that the page it gets is
executable (it got about 30 pages before it got one that's executable),
and then it does a change_page_attr(page,PAGE_KERNEL_NOCACHE) on the
page.  Shortly thereafter the system reboots--presumably it got a triple
fault because the exception handler was no longer executable, or
something like that.

My personal opinion of the best way to fix this would be option 3
above--just rewrite change_page_attr() so that the 511 other little
pages that are created inheret the _PAGE_NX bit from the large page that
was split--but make sure that it doesn't revert these PTEs back into a
large page if change_page_attr(page,PAGE_KERNEL) is called, since the
other pages in this 2MB area aren't PAGE_KERNEL.

Option 1 seems more correct to me, but not necessarily worth the
performance cost.

What do you think?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: page allocation/attributes question (i386/x86_64 specific)
@ 2005-06-28 20:16 Stuart_Hayes
  2005-06-30 16:15 ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart_Hayes @ 2005-06-28 20:16 UTC (permalink / raw)
  To: ak; +Cc: riel, andrea, linux-kernel

> On Fri, Jun 24, 2005 at 01:20:04PM -0500, Stuart_Hayes@Dell.com wrote:
>> Rik Van Riel wrote:
>>> On Wed, 22 Jun 2005 Stuart_Hayes@Dell.com wrote:
>>> 
>>>> My question is this:  when split_large_page() is called, should it
>>>> make the other 511 PTEs inherit the page attributes from the large
>>>> page (with the exception of PAGE_PSE, obviously)?
>>> 
>>> I suspect it should.
>> 
>> (Copying Andi Kleen & Andrea Arcangeli since they were involved in a
>> previous discussion of this.  I'm trying to fix the NX handling when
>> change_page_attr() is called in i386.)
>> 
>> After looking into it further, I agree completely.  I found a thread
>> in which this was discussed
>> (http://marc.theaimsgroup.com/?l=linux-kernel&m=109964359124731&w=2),
>> and discovered that this has been addressed in the X86_64 kernel. 
> 
> You don't say which kernel you were working against. 2.6.11 has some
> fixes in this area (in particular with the reference counts) 

Sorry for the delayed response.

I'm looking at 2.6.12 now.  I was looking at an older kernel, which set
up the bulk of the kernel memory as executable.  That appears to be
fixed in 2.6.12, but the problem I orignally brought up still remains
(see below).

> 
>> However, when I used "change_page_attr()" to change the page to
>> PAGE_KERNEL, it did just that.  But PAGE_KERNEL has the _PAGE_NX bit
>> set and isn't executable.  And, since PAGE_KERNEL (with _PAGE_NX set)
>> didn't match the original pages attributes, the 512 PTEs weren't
>> reverted back into a large page.  (Also, __change_page_attr() did
>> *another* get_page() on the page containing these 512 PTEs, so now
>> the page_count has gone up to 3, instead of going back down to 1 (or
>> staying at 2).)
> 
> That should be already fixed.

It doesn't appear to be fixed (in the i386 arch).  The
change_page_attr()/split_large_page() code will still still set all the
4K PTEs to PAGE_KERNEL (setting the _PAGE_NX bit) when a large page
needs to be split.

This wouldn't be a problem for the bulk of the kernel memory, but there
are pages in the lower 4MB of memory that's free, and are part of large
executable pages that also contain kernel code.  If change_page_attr()
is called on these, it will set the _PAGE_NX bit on the whole 2MB region
that was covered by the large page, causing a large chunk of kernel code
to be non-executable.

I can only think of a few ways to fix this:

(1) Make sure that any free pages that share a large-page-aligned area
with kernel code are not executable.  This would cause a big part of the
kernel code to be in small pages, which might have an effect on
performance.

(2) Make sure that there aren't any free pages in the large-page-aligned
areas that contain kernel executable code, by reserving it somehow.
This seems kind of wasteful of memory--ZONE_DMA memory, at that.

or (3) Allow the free pages in the 2MB large pages that also contain
kernel code to be executable, but somehow fix change_page_attr() so it
doesn't set the _PAGE_NX bit on the whole 2MB large page region when
called.  This would require one of the changes that I suggested earlier,
like ignoring the NX bit in the "prot" that's passed to
change_page_attr(), or just accepting the fact that calling
change_page_attr() on a large executable page will not be reverted back
to a large page if the calling function uses
change_page_attr(page,PAGE_KERNEL) when it is done.  Or changing the
interface, so that there's a change_page_attr() and an
unchange_page_attr() that automatically reverts the page back to
whatever the previous attributes were.

Am I missing something?

>> 
>> Is the function that calls "change_page_attr()" expected to look at
>> the attributes of the page before calling change_page_attr(), so it
>> knows how to un-change the attributes when it is finished with the
>> page (since PAGE_KERNEL isn't always what the page was originally)?
> 
> No, it's not supposed to do such checks on its own.
> 
>> 
>> Or should "change_page_attr()" ignore the NX bit in the "pgprot_t
>> prot" parameter that's passed to it, and just inherit NX from the
>> large/existing page?  Then change_page_attr(page,PAGE_KERNEL) could
>> be used to undo changes, but change_page_attr() couldn't be used to
>> modify the NX bit.
> 
> I don't think that makes sense. It should exactly set the protection
> the caller requested and revert when the protection is exactly like
> it was before.  
> 
> -Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: page allocation/attributes question (i386/x86_64 specific)
  2005-06-22 17:11 Stuart_Hayes
@ 2005-06-22 23:54 ` Rik Van Riel
  0 siblings, 0 replies; 25+ messages in thread
From: Rik Van Riel @ 2005-06-22 23:54 UTC (permalink / raw)
  To: Stuart_Hayes; +Cc: linux-kernel

On Wed, 22 Jun 2005 Stuart_Hayes@Dell.com wrote:

> My question is this:  when split_large_page() is called, should it make
> the other 511 PTEs inherit the page attributes from the large page (with
> the exception of PAGE_PSE, obviously)?

I suspect it should.

-- 
The Theory of Escalating Commitment: "The cost of continuing mistakes is
borne by others, while the cost of admitting mistakes is borne by yourself."
  -- Joseph Stiglitz, Nobel Laureate in Economics

^ permalink raw reply	[flat|nested] 25+ messages in thread

* page allocation/attributes question (i386/x86_64 specific)
@ 2005-06-22 17:11 Stuart_Hayes
  2005-06-22 23:54 ` Rik Van Riel
  0 siblings, 1 reply; 25+ messages in thread
From: Stuart_Hayes @ 2005-06-22 17:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Stuart_Hayes

 
I have a question regarding page attributes on i386/x86_64 platforms:

Supopse a function requests a page with __get_free_pages(), and the page
that's returned is actually part of a large (2MB) page that has a single
pte (or pmd, I guess) to control it.

Suppose this function then calls change_page_attr() to, say, modify the
caching policy for that page.  Since the page is part of a large 2MB
page (PAGE_PSE is set), change_page_attr() calls split_large_page() to
create 512 new PTEs, since the caching policy is only being changed on
the single 4K page, and not for the entire 2MB large page.

When split_large_page() creates the 512 new PTEs, it assigns the
requested attributes to the page in question, but it sets all of the
other 511 PTEs the PAGE_KERNEL attributes.  It never checks what
attributes were set for the large page--it just assumes that the other
511 pages should have PAGE_KERNEL attributes.

My question is this:  when split_large_page() is called, should it make
the other 511 PTEs inherit the page attributes from the large page (with
the exception of PAGE_PSE, obviously)?

There appears to be a bug (at least in certain 2.6 kernel versions)
where __get_free_pages() returns a page that's in a large page that is
executable (it doesn't have the PAGE_NX bit set)... but, after
change_page_attr() is called, the other 511 pages, which contain kernel
code, are no longer executable (because they were set to PAGE_KERNEL,
which has PAGE_NX set).

I've copied the split_large_page() code below for reference.

Thanks,
Stuart


static struct page *split_large_page(unsigned long address, pgprot_t
prot)
{ 
	int i; 
	unsigned long addr;
	struct page *base;
	pte_t *pbase;

	spin_unlock_irq(&cpa_lock);
	base = alloc_pages(GFP_KERNEL, 0);
	spin_lock_irq(&cpa_lock);
	if (!base) 
		return NULL;

	address = __pa(address);
	addr = address & LARGE_PAGE_MASK; 
	pbase = (pte_t *)page_address(base);
	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
				   addr == address ? prot :
PAGE_KERNEL);
	}
	return base;
} 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2005-07-20 20:46 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-06-24 18:20 page allocation/attributes question (i386/x86_64 specific) Stuart_Hayes
2005-06-25  2:28 ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2005-07-20 19:24 Stuart_Hayes
2005-07-20 15:06 Stuart_Hayes
2005-07-20 20:45 ` Ingo Molnar
2005-07-20 14:51 Stuart_Hayes
2005-07-20 15:00 ` Ingo Molnar
2005-07-19 21:20 Stuart Hayes
2005-07-19 22:04 ` Ingo Molnar
2005-07-07 17:21 Stuart_Hayes
2005-07-05 21:35 Stuart_Hayes
2005-07-05 22:00 ` randy_dunlap
2005-07-05 20:02 Stuart_Hayes
2005-07-05 21:27 ` randy_dunlap
2005-07-07 14:33 ` Andi Kleen
2005-07-01 20:33 Stuart_Hayes
2005-06-30 19:11 Stuart_Hayes
2005-06-30 17:14 Stuart_Hayes
2005-06-30 16:56 Stuart_Hayes
2005-06-30 17:34 ` Arjan van de Ven
2005-06-29 17:20 Stuart_Hayes
2005-06-28 20:16 Stuart_Hayes
2005-06-30 16:15 ` Andi Kleen
2005-06-22 17:11 Stuart_Hayes
2005-06-22 23:54 ` Rik Van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.