linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Memory leak in 2.4.27 kernel, using mmap raw packet sockets
@ 2004-10-14 14:50 bgagnon
  2004-10-15 18:23 ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: bgagnon @ 2004-10-14 14:50 UTC (permalink / raw)
  To: linux-kernel

(Please cc me directly, since I am not a subscriber to this list)

Hi,

I discovered a memory leak in the mmap raw packet socket implementation, 
specifically when the client application core dumps (elf format). More 
specifically, the entire ring buffer memory leaks under such 
circumstances. I have reproduced the problem with our custom app, which 
links with Phil Wood's mmap version of libpcap (
http://plublic.lanl.gov/cpw) as well as with tcpdump binding against the 
same library. I have asserted the problem presence in kernels 2.4.26 and 
2.4.27. It has been resolved in kernel 2.6.8.

In a nutshell, the reference count of the ring buffer page frame is not 
handled properly by the core dump function elf_core_dump() in 
fs/binfmt_elf.c. After exiting this function, all ring buffer page frames 
reference counts have been increased by one, which impedes their release 
when the socket is closed. 

The reference count mishap is related to the fact that the ring buffer 
page frames have their PG_reserved bit set when they are created by the 
raw packet socket. According to the comments in mm.h, this is to ensure 
the pages don't get swapped out to disk. elf_core_dump() calls 
get_user_pages(), which increments the reference count, irrespective of 
the PG_reserved bit, while the subsequent put_page() call only decrements 
it if the PG_reserved bit is clear.

According to a similar kernel mailing list thread (
http://www.ussg.iu.edu/hypermail/linux/kernel/0311.3/0495.html), the page 
reference count field is not significant if the PG_reserved bit is set. 
Applied to our particular case, the raw packet socket implementation 
should unconditionally reset the reference count when the socket is 
closed, so that the memory gets released. However, I fear this would cause 
user space applications to crash: according to the mmap man page, the 
memory mapping should survive the socket closure:

"The  munmap  system call deletes the mappings for the specified address
range, and causes further references to addresses within the  range  to
generate  invalid  memory references.  The region is also automatically
unmapped when the process is terminated.  On the  other  hand,  closing
the file descriptor does not unmap the region."

So, on 2.4 kernels, we are facing the following situation:

1- The reference count information IS important to control the ring buffer 
lifetime
2- The PG_reserved bit causes the reference count not to be handled 
properly by different kernel subsystems, the core dump functions being one 
of them

>From what I could understand, 2.6 developer fixed the get_user_pages() 
code so it does not increment to reference count if the page has the 
PG_reserved bit set. I don't know if this is readily applicable to 2.4 
kernels, since it may break some other subsystem that rely on its current 
behavior.

So, unless a better solution is found, I would propose a localized fix for 
the leak, in function elf_core_dump(), starting at line 1275  (comments 
are my own) :

if (get_user_pages(current, current->mm, addr, 1, 0, 1,   /*BG: this 
ALWAYS will increment the ref count*/
                   &page, &vma) <= 0) {
       DUMP_SEEK (file->f_pos + PAGE_SIZE);
} else {
       if (page == ZERO_PAGE(addr)) {
              DUMP_SEEK (file->f_pos + PAGE_SIZE);
    } else {
            void *kaddr;
            flush_cache_page(vma, addr);
            kaddr = kmap(page);
            DUMP_WRITE(kaddr, PAGE_SIZE);
            flush_page_to_ram(page);
            kunmap(page);
    }
    /*BG: Start of fix*/ 
    if (PageReserved(page))
            /* BG: just undo the ref count increase done in 
get_user_pages*/
            put_page_testzero(page);
    else
            put_page(page);
    /*BG: end of fix */
}



Thanks,

Bernard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-14 14:50 Memory leak in 2.4.27 kernel, using mmap raw packet sockets bgagnon
@ 2004-10-15 18:23 ` Marcelo Tosatti
  2004-10-17  2:39   ` Alan Cox
  0 siblings, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2004-10-15 18:23 UTC (permalink / raw)
  To: bgagnon; +Cc: linux-kernel

On Thu, Oct 14, 2004 at 10:50:14AM -0400, bgagnon@coradiant.com wrote:
> (Please cc me directly, since I am not a subscriber to this list)
> 
> Hi,
> 
> I discovered a memory leak in the mmap raw packet socket implementation, 
> specifically when the client application core dumps (elf format). More 
> specifically, the entire ring buffer memory leaks under such 
> circumstances. I have reproduced the problem with our custom app, which 
> links with Phil Wood's mmap version of libpcap (
> http://plublic.lanl.gov/cpw) as well as with tcpdump binding against the 
> same library. I have asserted the problem presence in kernels 2.4.26 and 
> 2.4.27. It has been resolved in kernel 2.6.8.
> 
> In a nutshell, the reference count of the ring buffer page frame is not 
> handled properly by the core dump function elf_core_dump() in 
> fs/binfmt_elf.c. After exiting this function, all ring buffer page frames 
> reference counts have been increased by one, which impedes their release 
> when the socket is closed. 
> 
> The reference count mishap is related to the fact that the ring buffer 
> page frames have their PG_reserved bit set when they are created by the 
> raw packet socket. According to the comments in mm.h, this is to ensure 
> the pages don't get swapped out to disk. elf_core_dump() calls 
> get_user_pages(), which increments the reference count, irrespective of 
> the PG_reserved bit, while the subsequent put_page() call only decrements 
> it if the PG_reserved bit is clear.
> 
> According to a similar kernel mailing list thread (
> http://www.ussg.iu.edu/hypermail/linux/kernel/0311.3/0495.html), the page 
> reference count field is not significant if the PG_reserved bit is set. 
> Applied to our particular case, the raw packet socket implementation 
> should unconditionally reset the reference count when the socket is 
> closed, so that the memory gets released. However, I fear this would cause 
> user space applications to crash: according to the mmap man page, the 
> memory mapping should survive the socket closure:
> 
> "The  munmap  system call deletes the mappings for the specified address
> range, and causes further references to addresses within the  range  to
> generate  invalid  memory references.  The region is also automatically
> unmapped when the process is terminated.  On the  other  hand,  closing
> the file descriptor does not unmap the region."
> 
> So, on 2.4 kernels, we are facing the following situation:
> 
> 1- The reference count information IS important to control the ring buffer 
> lifetime
> 2- The PG_reserved bit causes the reference count not to be handled 
> properly by different kernel subsystems, the core dump functions being one 
> of them
> 
> >From what I could understand, 2.6 developer fixed the get_user_pages() 
> code so it does not increment to reference count if the page has the 
> PG_reserved bit set. I don't know if this is readily applicable to 2.4 
> kernels, since it may break some other subsystem that rely on its current 
> behavior.
> 
> So, unless a better solution is found, I would propose a localized fix for 
> the leak, in function elf_core_dump(), starting at line 1275  (comments 
> are my own) :
> 
> if (get_user_pages(current, current->mm, addr, 1, 0, 1,   /*BG: this 
> ALWAYS will increment the ref count*/
>                    &page, &vma) <= 0) {
>        DUMP_SEEK (file->f_pos + PAGE_SIZE);
> } else {
>        if (page == ZERO_PAGE(addr)) {
>               DUMP_SEEK (file->f_pos + PAGE_SIZE);
>     } else {
>             void *kaddr;
>             flush_cache_page(vma, addr);
>             kaddr = kmap(page);
>             DUMP_WRITE(kaddr, PAGE_SIZE);
>             flush_page_to_ram(page);
>             kunmap(page);
>     }
>     /*BG: Start of fix*/ 
>     if (PageReserved(page))
>             /* BG: just undo the ref count increase done in 
> get_user_pages*/
>             put_page_testzero(page);
>     else
>             put_page(page);
>     /*BG: end of fix */
> }

Hi Bernand, 

I prefer doing the "if (PageReserved(page)) put_page_testzero(page)" as
you propose instead of changing get_user_pages(), as there are several
users which rely on its behaviour.

I have applied your fix to the 2.4 BK tree.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-15 18:23 ` Marcelo Tosatti
@ 2004-10-17  2:39   ` Alan Cox
  2004-10-19 14:35     ` Marcelo Tosatti
  2004-11-25 15:02     ` Marcelo Tosatti
  0 siblings, 2 replies; 19+ messages in thread
From: Alan Cox @ 2004-10-17  2:39 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: bgagnon, Linux Kernel Mailing List

On Gwe, 2004-10-15 at 19:23, Marcelo Tosatti wrote:
> I prefer doing the "if (PageReserved(page)) put_page_testzero(page)" as
> you propose instead of changing get_user_pages(), as there are several
> users which rely on its behaviour.
> 
> I have applied your fix to the 2.4 BK tree.

That isnt sufficient. Consider anything else taking a reference to the
page and the refcount going negative. And yes 2.6.x has this problem and
far worse in some ways, but it also has the mechanism to fix it.

2.6.x uses VM_IO as a VMA flag which tells the kernel two things
a) get_user_pages fails on it
b) core dumping of it is forbidden

2.6.x is missing a whole pile of these (fixed in the 2.6.9-ac tree I'm
putting together). I *think* remap_page_range() in 2.6.x can just set
VM_IO, but older kernels didn't pass the vma so all the users would need
fixing (OSS audio, media/video, usb audio, usb video, frame buffer
etc).

Alan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-17  2:39   ` Alan Cox
@ 2004-10-19 14:35     ` Marcelo Tosatti
  2004-10-20 18:43       ` Alan Cox
  2004-11-25 15:02     ` Marcelo Tosatti
  1 sibling, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2004-10-19 14:35 UTC (permalink / raw)
  To: Alan Cox; +Cc: bgagnon, Linux Kernel Mailing List

On Sun, Oct 17, 2004 at 03:39:26AM +0100, Alan Cox wrote:
> On Gwe, 2004-10-15 at 19:23, Marcelo Tosatti wrote:
> > I prefer doing the "if (PageReserved(page)) put_page_testzero(page)" as
> > you propose instead of changing get_user_pages(), as there are several
> > users which rely on its behaviour.
> > 
> > I have applied your fix to the 2.4 BK tree.
> 
> That isnt sufficient. Consider anything else taking a reference to the
> page and the refcount going negative. 

You mean not going negative? The problem here as I understand here is 
we dont release the count if the PageReserved is set, but we should. 

You mean there are other codepaths which release pages? That use __free_pages
which ignores PageReserved pages.

Is the problem wider than what I think?

> And yes 2.6.x has this problem and
> far worse in some ways, but it also has the mechanism to fix it.
> 
> 2.6.x uses VM_IO as a VMA flag which tells the kernel two things
> a) get_user_pages fails on it
> b) core dumping of it is forbidden
> 
> 2.6.x is missing a whole pile of these (fixed in the 2.6.9-ac tree I'm
> putting together). I *think* remap_page_range() in 2.6.x can just set
> VM_IO, but older kernels didn't pass the vma so all the users would need
> fixing (OSS audio, media/video, usb audio, usb video, frame buffer
> etc).

All these are have codepaths which release pages using put_page()'s? 

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-19 14:35     ` Marcelo Tosatti
@ 2004-10-20 18:43       ` Alan Cox
  2004-10-20 23:24         ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Alan Cox @ 2004-10-20 18:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: bgagnon, Linux Kernel Mailing List

On Maw, 2004-10-19 at 15:35, Marcelo Tosatti wrote:
> > That isnt sufficient. Consider anything else taking a reference to the
> > page and the refcount going negative. 
> 
> You mean not going negative? The problem here as I understand here is 
> we dont release the count if the PageReserved is set, but we should.

Drivers like the OSS audio drivers move page between Reserved and 
unreserved. The count can thus be corrupted.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-20 18:43       ` Alan Cox
@ 2004-10-20 23:24         ` Andrea Arcangeli
  2004-10-23 14:17           ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2004-10-20 23:24 UTC (permalink / raw)
  To: Alan Cox; +Cc: Marcelo Tosatti, bgagnon, Linux Kernel Mailing List

On Wed, Oct 20, 2004 at 07:43:58PM +0100, Alan Cox wrote:
> On Maw, 2004-10-19 at 15:35, Marcelo Tosatti wrote:
> > > That isnt sufficient. Consider anything else taking a reference to the
> > > page and the refcount going negative. 
> > 
> > You mean not going negative? The problem here as I understand here is 
> > we dont release the count if the PageReserved is set, but we should.
> 
> Drivers like the OSS audio drivers move page between Reserved and 
> unreserved. The count can thus be corrupted.

the PG_reserved goes away after VM_IO, so forbidding pages with
PG_reserved of vmas with VM_IO isn't any different as far as I can tell,
and since PG_reserved is the real offender sure we shouldn't leave a
check in get_user_pages that explicitly do something if the page is
reserved, since if the page is reserved at that point we'd need to
return -EFAULT or BUG_ON.

Adding the VM_IO patch on top of this is sure a good idea.

--- sles/mm/memory.c.~1~	2004-10-19 19:34:10.264335488 +0200
+++ sles/mm/memory.c	2004-10-19 19:58:47.403776160 +0200
@@ -806,7 +806,7 @@ int get_user_pages(struct task_struct *t
 			}
 			if (pages) {
 				pages[i] = get_page_map(map);
-				if (!pages[i]) {
+				if (!pages[i] || PageReserved(pages[i])) {
 					spin_unlock(&mm->page_table_lock);
 					while (i--)
 						page_cache_release(pages[i]);
@@ -814,8 +814,7 @@ int get_user_pages(struct task_struct *t
 					goto out;
 				}
 				flush_dcache_page(pages[i]);
-				if (!PageReserved(pages[i]))
-					page_cache_get(pages[i]);
+				page_cache_get(pages[i]);
 			}
 			if (vmas)
 				vmas[i] = vma;

My version of the fix for 2.4 is this, but this fixes as well an issue
with the zeropage and it's on top of some other fix for COW corruption
in 2.4 not yet fixed in mainline 2.4. Since 2.4 never checked
PageReserved like 2.6 does in get_user_pages, 2.4 as worse can suffer a
memleak.

--- sles/include/linux/mm.h.~1~	2004-10-18 10:20:53.391823696 +0200
+++ sles/include/linux/mm.h	2004-10-18 10:47:10.861011928 +0200
@@ -533,9 +533,8 @@ extern void unpin_pte_page(struct page *
 
 static inline void put_user_page_pte_pin(struct page * page)
 {
-	if (PagePinned(page))
-		/* must run before put_page, put_page may free the page */
-		unpin_pte_page(page);
+	/* must run before put_page, put_page may free the page */
+	unpin_pte_page(page);
 
 	put_page(page);
 }
--- sles/mm/memory.c.~1~	2004-10-18 10:20:54.947587184 +0200
+++ sles/mm/memory.c	2004-10-18 10:47:49.822088944 +0200
@@ -530,7 +530,11 @@ void __wait_on_pte_pinned_page(struct pa
 
 void unpin_pte_page(struct page *page)
 {
-	wait_queue_head_t *waitqueue = page_waitqueue(page);
+	wait_queue_head_t *waitqueue;
+
+	if (!PagePinned(page))
+		return;
+	waitqueue = page_waitqueue(page);
 	if (unlikely(!TestClearPagePinned(page)))
 		BUG();
 	smp_mb__after_clear_bit(); 
@@ -598,17 +602,21 @@ int __get_user_pages(struct task_struct 
 				 */
 				if (!map)
 					goto bad_page;
-				page_cache_get(map);
-				if (pte_pin && unlikely(TestSetPagePinned(map))) {
-					/* fail if this is a duplicate physical page in this kiovec */
-					int i2 = i;
-					while (i2--)
-						if (map == pages[i2]) {
-							put_page(map);
-							goto bad_page;
-						}
-					/* hold a reference on "map" so we can wait on it */
-					goto pte_pin_collision;
+				if (map != ZERO_PAGE(start)) {
+					if (PageReserved(map))
+						goto bad_page;
+					page_cache_get(map);
+					if (pte_pin && unlikely(TestSetPagePinned(map))) {
+						/* fail if this is a duplicate physical page in this kiovec */
+						int i2 = i;
+						while (i2--)
+							if (map == pages[i2]) {
+								put_page(map);
+								goto bad_page;
+							}
+						/* hold a reference on "map" so we can wait on it */
+						goto pte_pin_collision;
+					}
 				}
 				pages[i] = map;
 			}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-20 23:24         ` Andrea Arcangeli
@ 2004-10-23 14:17           ` Marcelo Tosatti
  0 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2004-10-23 14:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alan Cox, bgagnon, Linux Kernel Mailing List


Hi Alan, Andrea,

On Thu, Oct 21, 2004 at 01:24:24AM +0200, Andrea Arcangeli wrote:
> On Wed, Oct 20, 2004 at 07:43:58PM +0100, Alan Cox wrote:
> > On Maw, 2004-10-19 at 15:35, Marcelo Tosatti wrote:
> > > > That isnt sufficient. Consider anything else taking a reference to the
> > > > page and the refcount going negative. 
> > > 
> > > You mean not going negative? The problem here as I understand here is 
> > > we dont release the count if the PageReserved is set, but we should.
> > 
> > Drivers like the OSS audio drivers move page between Reserved and 
> > unreserved. The count can thus be corrupted.
> 
> the PG_reserved goes away after VM_IO, so forbidding pages with
> PG_reserved of vmas with VM_IO isn't any different as far as I can tell,
> and since PG_reserved is the real offender sure we shouldn't leave a
> check in get_user_pages that explicitly do something if the page is
> reserved, since if the page is reserved at that point we'd need to
> return -EFAULT or BUG_ON.
> 
> Adding the VM_IO patch on top of this is sure a good idea.
> 
> --- sles/mm/memory.c.~1~	2004-10-19 19:34:10.264335488 +0200
> +++ sles/mm/memory.c	2004-10-19 19:58:47.403776160 +0200
> @@ -806,7 +806,7 @@ int get_user_pages(struct task_struct *t
>  			}
>  			if (pages) {
>  				pages[i] = get_page_map(map);
> -				if (!pages[i]) {
> +				if (!pages[i] || PageReserved(pages[i])) {
>  					spin_unlock(&mm->page_table_lock);
>  					while (i--)
>  						page_cache_release(pages[i]);
> @@ -814,8 +814,7 @@ int get_user_pages(struct task_struct *t
>  					goto out;
>  				}
>  				flush_dcache_page(pages[i]);
> -				if (!PageReserved(pages[i]))
> -					page_cache_get(pages[i]);
> +				page_cache_get(pages[i]);
>  			}
>  			if (vmas)
>  				vmas[i] = vma;
> 
> My version of the fix for 2.4 is this, but this fixes as well an issue
> with the zeropage and it's on top of some other fix for COW corruption
> in 2.4 not yet fixed in mainline 2.4. Since 2.4 never checked
> PageReserved like 2.6 does in get_user_pages, 2.4 as worse can suffer a
> memleak.

I wonder what are the consequences and dangers of not referencing 
PG_reserved pages at get_user_pages(). I agree the current behaviour 
of referencing reserved pages and not unreferencing PageReserved at
free_pages is broken.

However, access_process_vm(), used by ptrace, map_user_kiobuf() and the 
elf core dumping code rely on the additional reference count 
increased by get_user_pages() to guarantee the existance of 
the page(s). What about that?

I need to understand PG_reserved usage by drivers in more details.

Sure the above patch would fix the problem reported
(including the elf core dump memory leak).

Now Alan's case of pages being changed from/to PG_reserved 
is pretty damn nasty isnt it (uncovered by this patch from
what I understand).


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-17  2:39   ` Alan Cox
  2004-10-19 14:35     ` Marcelo Tosatti
@ 2004-11-25 15:02     ` Marcelo Tosatti
  2004-11-25 20:32       ` Andrea Arcangeli
  1 sibling, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2004-11-25 15:02 UTC (permalink / raw)
  To: Alan Cox; +Cc: bgagnon, Linux Kernel Mailing List, Andrea Arcangeli, davem

On Sun, Oct 17, 2004 at 03:39:26AM +0100, Alan Cox wrote:
> On Gwe, 2004-10-15 at 19:23, Marcelo Tosatti wrote:
> > I prefer doing the "if (PageReserved(page)) put_page_testzero(page)" as
> > you propose instead of changing get_user_pages(), as there are several
> > users which rely on its behaviour.
> > 
> > I have applied your fix to the 2.4 BK tree.
> 
> That isnt sufficient. Consider anything else taking a reference to the
> page and the refcount going negative. And yes 2.6.x has this problem and
> far worse in some ways, but it also has the mechanism to fix it.
> 
> 2.6.x uses VM_IO as a VMA flag which tells the kernel two things
> a) get_user_pages fails on it
> b) core dumping of it is forbidden
> 
> 2.6.x is missing a whole pile of these (fixed in the 2.6.9-ac tree I'm
> putting together). I *think* remap_page_range() in 2.6.x can just set
> VM_IO, but older kernels didn't pass the vma so all the users would need
> fixing (OSS audio, media/video, usb audio, usb video, frame buffer
> etc).

I dont see any practical problem with 2.4.x right now.

get_user_pages() wont be called on driver created VMA's with PageReserved 
pages because of the VM_IO bit which is set at remap_page_range(). 

Its not possible to have any vma mapped by a driver without VM_IO set.

But the network packet mmap was an isolated case, so I'll apply Andrea's 
fix just for safety, although I can't find any offender in the tree.


--- memory.c    2004-10-22 15:58:28.000000000 -0200
+++ memory.c  2004-10-28 14:32:26.585813200 -0200
@@ -499,7 +499,7 @@
                                /* FIXME: call the correct function,
                                 * depending on the type of the found page
                                 */
-                               if (!pages[i])
+                               if (!pages[i] || PageReserved(pages[i]))
                                        goto bad_page;
                                page_cache_get(pages[i]);
                        }



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-25 20:32       ` Andrea Arcangeli
@ 2004-11-25 17:12         ` Marcelo Tosatti
  2004-11-25 23:13           ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2004-11-25 17:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, bgagnon, Linux Kernel Mailing List, Andrea Arcangeli, davem

Hi Andrea,

On Thu, Nov 25, 2004 at 09:32:48PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 25, 2004 at 01:02:06PM -0200, Marcelo Tosatti wrote:
> > On Sun, Oct 17, 2004 at 03:39:26AM +0100, Alan Cox wrote:
> > > On Gwe, 2004-10-15 at 19:23, Marcelo Tosatti wrote:
> > > > I prefer doing the "if (PageReserved(page)) put_page_testzero(page)" as
> > > > you propose instead of changing get_user_pages(), as there are several
> > > > users which rely on its behaviour.
> > > > 
> > > > I have applied your fix to the 2.4 BK tree.
> > > 
> > > That isnt sufficient. Consider anything else taking a reference to the
> > > page and the refcount going negative. And yes 2.6.x has this problem and
> > > far worse in some ways, but it also has the mechanism to fix it.
> > > 
> > > 2.6.x uses VM_IO as a VMA flag which tells the kernel two things
> > > a) get_user_pages fails on it
> > > b) core dumping of it is forbidden
> > > 
> > > 2.6.x is missing a whole pile of these (fixed in the 2.6.9-ac tree I'm
> > > putting together). I *think* remap_page_range() in 2.6.x can just set
> > > VM_IO, but older kernels didn't pass the vma so all the users would need
> > > fixing (OSS audio, media/video, usb audio, usb video, frame buffer
> > > etc).
> > 
> > I dont see any practical problem with 2.4.x right now.
> > 
> > get_user_pages() wont be called on driver created VMA's with PageReserved 
> > pages because of the VM_IO bit which is set at remap_page_range(). 
> > 
> > Its not possible to have any vma mapped by a driver without VM_IO set.
> > 
> > But the network packet mmap was an isolated case, so I'll apply Andrea's 
> > fix just for safety, although I can't find any offender in the tree.
> > 
> > 
> > --- memory.c    2004-10-22 15:58:28.000000000 -0200
> > +++ memory.c  2004-10-28 14:32:26.585813200 -0200
> > @@ -499,7 +499,7 @@
> >                                 /* FIXME: call the correct function,
> >                                  * depending on the type of the found page
> >                                  */
> > -                               if (!pages[i])
> > +                               if (!pages[i] || PageReserved(pages[i]))
> >                                         goto bad_page;
> >                                 page_cache_get(pages[i]);
> >                         }
> 
> this needs to be modified to take the ZERO_PAGE(start) into account.
> It's a minor detail, but we should allow that. (my 2.4-aa tree was
> already checking the zeropage for other reasons, so it didn't need any
> change)
> 
> in short the above fix should be modified like this:
> 
> 	if (!pages[i] || PageReserved(pages[i])) {
> 		if (pages[i] != ZERO_PAGE(start))
> 			goto bad_page;
> 	} else
> 		page_cache_get(pages[i]);
> 
> the zero page is guaranteed to remain reserved. the major bug is that
> get_user_pages was allowing _temporarily_ reserved pages to be pinned.
> __free_pages isn't checking the VM_IO, and as such get_user_pages should
> be robust against its __free_pages counterpart, this is why I believe
> the above fix is the right fix.
> 
> ZEROPAGE is special because:
> 
> 1) it's guaranteed to never be unpinned
> 2) it's the only reserved page that handle_mm_fault is allowed to
> istantiate for a filesystem data mapping
>
> The VM_IO enforcment is a nice improvement on top of the above.

get_user_pages() bails out if ! VM_IO. 

Is that what you mean with "VM_IO enforcement" ? 

> Checking the PageReserved bitflag is a good thing at least for the
> zeropage, so we don't overflow the zeropage count, which isn't nice.
> 
> If you really want to fix it only using VM_IO, I still recommend to
> apply the above patch, and to turn 'goto bad_page' into BUG().
> It'd be a bad idea not to at least add the above code as a robustness
> check.
>
> Comments welcome. thanks.

I thought about the BUG() to catch potential offenders, but I was
not sure if it was possible for a PG_reserved page to be part of VMA's 
which was being get_user_pages'd.

Now you tell me it is possible, and thats only the ZERO page. Fine. 

This is what you suggests plus some extra hopefully useful debugging 


--- memory.c.orig	2004-11-25 14:51:00.074508952 -0200
+++ memory.c	2004-11-25 15:08:38.026675776 -0200
@@ -454,8 +454,9 @@
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas)
 {
-	int i;
+	int i, s;
 	unsigned int flags;
+	struct vm_area_struct *savevma = NULL;
 
 	/*
 	 * Require read or write permissions.
@@ -463,7 +464,7 @@
 	 */
 	flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
 	flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
-	i = 0;
+	i = s = 0;
 
 	do {
 		struct vm_area_struct *	vma;
@@ -499,9 +500,13 @@
 				/* FIXME: call the correct function,
 				 * depending on the type of the found page
 				 */
-				if (!pages[i] || PageReserved(pages[i]))
-					goto bad_page;
-				page_cache_get(pages[i]);
+				if (!pages[i] || PageReserved(pages[i])) {
+					if (pages[i] != ZERO_PAGE(start)) {
+						savevma = vma;
+						goto bad_page;
+					}
+				} else
+					page_cache_get(pages[i]);
 			}
 			if (vmas)
 				vmas[i] = vma;
@@ -520,9 +525,15 @@
 	 */
 bad_page:
 	spin_unlock(&mm->page_table_lock);
+	s = i;
 	while (i--)
 		page_cache_release(pages[i]);
-	i = -EFAULT;
+	/* catch bad uses of PG_reserved on !VM_IO vma's */
+	printk(KERN_ERR "get_user_pages PG_reserved page on"
+			"vma:%p flags:%lx page:%d\n", savevma,
+			savevma->flags, s);
+	BUG();
+	i = -EFAULT;
 	goto out;
 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-25 23:13           ` Andrea Arcangeli
@ 2004-11-25 19:45             ` Marcelo Tosatti
  2004-11-26  1:04               ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2004-11-25 19:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, bgagnon, Linux Kernel Mailing List, Andrea Arcangeli, davem

On Fri, Nov 26, 2004 at 12:13:14AM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 25, 2004 at 03:12:42PM -0200, Marcelo Tosatti wrote:
> > get_user_pages() bails out if ! VM_IO. 
> > 
> > Is that what you mean with "VM_IO enforcement" ? 
> 
> yes. It bails out if VM_IO is set (not ! clear ;)
> 
> > I thought about the BUG() to catch potential offenders, but I was
> > not sure if it was possible for a PG_reserved page to be part of VMA's 
> > which was being get_user_pages'd.
> 
> Exactly, it's much safer to go with the real fix of fixing it in
> get_user_pages. If something we should put a bugcheck there.
> 
> > Now you tell me it is possible, and thats only the ZERO page. Fine. 
> 
> Yes, and the ZERO_PAGE is actually the _only_ reserved page we must
> allow to go through. Every other reserved page must be discarded (or
> kernel-crash with BUG_ON if Alan feels confortable with the VM_IO
> enforcement). 

Oh the VM_IO enforcement has been there for ages.

> > This is what you suggests plus some extra hopefully useful debugging 
> > 
> > 
> > --- memory.c.orig	2004-11-25 14:51:00.074508952 -0200
> > +++ memory.c	2004-11-25 15:08:38.026675776 -0200
> > @@ -454,8 +454,9 @@
> >  int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
> >  		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas)
> >  {
> > -	int i;
> > +	int i, s;
> >  	unsigned int flags;
> > +	struct vm_area_struct *savevma = NULL;
> >  
> >  	/*
> >  	 * Require read or write permissions.
> > @@ -463,7 +464,7 @@
> >  	 */
> >  	flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
> >  	flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
> > -	i = 0;
> > +	i = s = 0;
> >  
> >  	do {
> >  		struct vm_area_struct *	vma;
> > @@ -499,9 +500,13 @@
> >  				/* FIXME: call the correct function,
> >  				 * depending on the type of the found page
> >  				 */
> > -				if (!pages[i] || PageReserved(pages[i]))
> > -					goto bad_page;
> > -				page_cache_get(pages[i]);
> > +				if (!pages[i] || PageReserved(pages[i])) {
> > +					if (pages[i] != ZERO_PAGE(start)) {
> > +						savevma = vma;
> > +						goto bad_page;
> > +					}
> > +				} else
> > +					page_cache_get(pages[i]);
> >  			}
> >  			if (vmas)
> >  				vmas[i] = vma;
> > @@ -520,9 +525,15 @@
> >  	 */
> >  bad_page:
> >  	spin_unlock(&mm->page_table_lock);
> > +	s = i;
> >  	while (i--)
> >  		page_cache_release(pages[i]);
> > -	i = -EFAULT;
> > +	/* catch bad uses of PG_reserved on !VM_IO vma's */
> > +	printk(KERN_ERR "get_user_pages PG_reserved page on"
> > +			"vma:%p flags:%lx page:%d\n", savevma,
> > +			savevma->flags, s);
> > +	BUG();
> > +	i = -EFAULT;
> >  	goto out;
> >  }
> 
> Yes, however I wouldn't turn on the debugging code just in case some
> driver forgets to set VM_IO and it doesn't use remap_page_range. There's
> nothing fundamentally fatal in having a reserved page in a non VM_IO
> vma (I mean, after fixing the above bit ;).

Sure, I'll comment the BUG() off during 2.4.29-rc.

How does that sound?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-25 15:02     ` Marcelo Tosatti
@ 2004-11-25 20:32       ` Andrea Arcangeli
  2004-11-25 17:12         ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2004-11-25 20:32 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Alan Cox, bgagnon, Linux Kernel Mailing List, Andrea Arcangeli, davem

On Thu, Nov 25, 2004 at 01:02:06PM -0200, Marcelo Tosatti wrote:
> On Sun, Oct 17, 2004 at 03:39:26AM +0100, Alan Cox wrote:
> > On Gwe, 2004-10-15 at 19:23, Marcelo Tosatti wrote:
> > > I prefer doing the "if (PageReserved(page)) put_page_testzero(page)" as
> > > you propose instead of changing get_user_pages(), as there are several
> > > users which rely on its behaviour.
> > > 
> > > I have applied your fix to the 2.4 BK tree.
> > 
> > That isnt sufficient. Consider anything else taking a reference to the
> > page and the refcount going negative. And yes 2.6.x has this problem and
> > far worse in some ways, but it also has the mechanism to fix it.
> > 
> > 2.6.x uses VM_IO as a VMA flag which tells the kernel two things
> > a) get_user_pages fails on it
> > b) core dumping of it is forbidden
> > 
> > 2.6.x is missing a whole pile of these (fixed in the 2.6.9-ac tree I'm
> > putting together). I *think* remap_page_range() in 2.6.x can just set
> > VM_IO, but older kernels didn't pass the vma so all the users would need
> > fixing (OSS audio, media/video, usb audio, usb video, frame buffer
> > etc).
> 
> I dont see any practical problem with 2.4.x right now.
> 
> get_user_pages() wont be called on driver created VMA's with PageReserved 
> pages because of the VM_IO bit which is set at remap_page_range(). 
> 
> Its not possible to have any vma mapped by a driver without VM_IO set.
> 
> But the network packet mmap was an isolated case, so I'll apply Andrea's 
> fix just for safety, although I can't find any offender in the tree.
> 
> 
> --- memory.c    2004-10-22 15:58:28.000000000 -0200
> +++ memory.c  2004-10-28 14:32:26.585813200 -0200
> @@ -499,7 +499,7 @@
>                                 /* FIXME: call the correct function,
>                                  * depending on the type of the found page
>                                  */
> -                               if (!pages[i])
> +                               if (!pages[i] || PageReserved(pages[i]))
>                                         goto bad_page;
>                                 page_cache_get(pages[i]);
>                         }

this needs to be modified to take the ZERO_PAGE(start) into account.
It's a minor detail, but we should allow that. (my 2.4-aa tree was
already checking the zeropage for other reasons, so it didn't need any
change)

in short the above fix should be modified like this:

	if (!pages[i] || PageReserved(pages[i])) {
		if (pages[i] != ZERO_PAGE(start))
			goto bad_page;
	} else
		page_cache_get(pages[i]);

the zero page is guaranteed to remain reserved. the major bug is that
get_user_pages was allowing _temporarily_ reserved pages to be pinned.
__free_pages isn't checking the VM_IO, and as such get_user_pages should
be robust against its __free_pages counterpart, this is why I believe
the above fix is the right fix.

ZEROPAGE is special because:

1) it's guaranteed to never be unpinned
2) it's the only reserved page that handle_mm_fault is allowed to
istantiate for a filesystem data mapping

The VM_IO enforcment is a nice improvement on top of the above.

Checking the PageReserved bitflag is a good thing at least for the
zeropage, so we don't overflow the zeropage count, which isn't nice.

If you really want to fix it only using VM_IO, I still recommend to
apply the above patch, and to turn 'goto bad_page' into BUG().
It'd be a bad idea not to at least add the above code as a robustness
check.

Comments welcome. thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-25 17:12         ` Marcelo Tosatti
@ 2004-11-25 23:13           ` Andrea Arcangeli
  2004-11-25 19:45             ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2004-11-25 23:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Alan Cox, bgagnon, Linux Kernel Mailing List, Andrea Arcangeli, davem

On Thu, Nov 25, 2004 at 03:12:42PM -0200, Marcelo Tosatti wrote:
> get_user_pages() bails out if ! VM_IO. 
> 
> Is that what you mean with "VM_IO enforcement" ? 

yes. It bails out if VM_IO is set (not ! clear ;)

> I thought about the BUG() to catch potential offenders, but I was
> not sure if it was possible for a PG_reserved page to be part of VMA's 
> which was being get_user_pages'd.

Exactly, it's much safer to go with the real fix of fixing it in
get_user_pages. If something we should put a bugcheck there.

> Now you tell me it is possible, and thats only the ZERO page. Fine. 

Yes, and the ZERO_PAGE is actually the _only_ reserved page we must
allow to go through. Every other reserved page must be discarded (or
kernel-crash with BUG_ON if Alan feels confortable with the VM_IO
enforcement).

> This is what you suggests plus some extra hopefully useful debugging 
> 
> 
> --- memory.c.orig	2004-11-25 14:51:00.074508952 -0200
> +++ memory.c	2004-11-25 15:08:38.026675776 -0200
> @@ -454,8 +454,9 @@
>  int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
>  		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas)
>  {
> -	int i;
> +	int i, s;
>  	unsigned int flags;
> +	struct vm_area_struct *savevma = NULL;
>  
>  	/*
>  	 * Require read or write permissions.
> @@ -463,7 +464,7 @@
>  	 */
>  	flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
>  	flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
> -	i = 0;
> +	i = s = 0;
>  
>  	do {
>  		struct vm_area_struct *	vma;
> @@ -499,9 +500,13 @@
>  				/* FIXME: call the correct function,
>  				 * depending on the type of the found page
>  				 */
> -				if (!pages[i] || PageReserved(pages[i]))
> -					goto bad_page;
> -				page_cache_get(pages[i]);
> +				if (!pages[i] || PageReserved(pages[i])) {
> +					if (pages[i] != ZERO_PAGE(start)) {
> +						savevma = vma;
> +						goto bad_page;
> +					}
> +				} else
> +					page_cache_get(pages[i]);
>  			}
>  			if (vmas)
>  				vmas[i] = vma;
> @@ -520,9 +525,15 @@
>  	 */
>  bad_page:
>  	spin_unlock(&mm->page_table_lock);
> +	s = i;
>  	while (i--)
>  		page_cache_release(pages[i]);
> -	i = -EFAULT;
> +	/* catch bad uses of PG_reserved on !VM_IO vma's */
> +	printk(KERN_ERR "get_user_pages PG_reserved page on"
> +			"vma:%p flags:%lx page:%d\n", savevma,
> +			savevma->flags, s);
> +	BUG();
> +	i = -EFAULT;
>  	goto out;
>  }

Yes, however I wouldn't turn on the debugging code just in case some
driver forgets to set VM_IO and it doesn't use remap_page_range. There's
nothing fundamentally fatal in having a reserved page in a non VM_IO
vma (I mean, after fixing the above bit ;).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-25 19:45             ` Marcelo Tosatti
@ 2004-11-26  1:04               ` Andrea Arcangeli
  2004-11-30  4:03                 ` David S. Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2004-11-26  1:04 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Alan Cox, bgagnon, Linux Kernel Mailing List, Andrea Arcangeli, davem

On Thu, Nov 25, 2004 at 05:45:09PM -0200, Marcelo Tosatti wrote:
> Oh the VM_IO enforcement has been there for ages.

VM_IO is there for ages, but it wasn't used by all drivers, this is why
we got the problem... ;)

> Sure, I'll comment the BUG() off during 2.4.29-rc.
> 
> How does that sound?

Sounds good ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-26  1:04               ` Andrea Arcangeli
@ 2004-11-30  4:03                 ` David S. Miller
  2004-11-30  4:16                   ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: David S. Miller @ 2004-11-30  4:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: marcelo.tosatti, alan, bgagnon, linux-kernel, andrea, davem


These changes have made 2.4.29-BK stop booting on sparc64.
I'll get more information to find out exactly why.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-30  4:03                 ` David S. Miller
@ 2004-11-30  4:16                   ` Andrea Arcangeli
  2004-11-30  6:11                     ` David S. Miller
  2004-11-30  6:19                     ` David S. Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2004-11-30  4:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: marcelo.tosatti, alan, bgagnon, linux-kernel, davem

On Mon, Nov 29, 2004 at 08:03:31PM -0800, David S. Miller wrote:
> 
> These changes have made 2.4.29-BK stop booting on sparc64.
> I'll get more information to find out exactly why.

hmm strange, I would suggest to put a dump_stack here to see where it
triggers:

                               if (!pages[i] || PageReserved(pages[i])) {
                                       if (pages[i] != ZERO_PAGE(start)) {
+					       dump_stack()
                                               savevma = vma;
                                               goto bad_page;   
                                       }

                               } else

The only thing I see that's probably missing for sparc is a
flush_dcache_page page_cache_get, but that wouldn't prevent booting,
it'd only corrupt userland as worse (and it was missing from the
previous code too ;).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-30  4:16                   ` Andrea Arcangeli
@ 2004-11-30  6:11                     ` David S. Miller
  2004-11-30  6:19                     ` David S. Miller
  1 sibling, 0 replies; 19+ messages in thread
From: David S. Miller @ 2004-11-30  6:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: marcelo.tosatti, alan, bgagnon, linux-kernel, davem

On Tue, 30 Nov 2004 05:16:56 +0100
Andrea Arcangeli <andrea@suse.de> wrote:

> On Mon, Nov 29, 2004 at 08:03:31PM -0800, David S. Miller wrote:
> > 
> > These changes have made 2.4.29-BK stop booting on sparc64.
> > I'll get more information to find out exactly why.
> 
> hmm strange, I would suggest to put a dump_stack here to see where it
> triggers:
> 
>                                if (!pages[i] || PageReserved(pages[i])) {
>                                        if (pages[i] != ZERO_PAGE(start)) {
> +					       dump_stack()
>                                                savevma = vma;
>                                                goto bad_page;   
>                                        }
> 
>                                } else
> 
> The only thing I see that's probably missing for sparc is a
> flush_dcache_page page_cache_get, but that wouldn't prevent booting,
> it'd only corrupt userland as worse (and it was missing from the
> previous code too ;).

It's happening very early on, when we boot the second processor,
which means that something wrt. forking the first kernel thread
is perhaps being messed up.

I can't see how get_user_pages() could be involved there, so like I
said I'll try to get some more info.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-11-30  4:16                   ` Andrea Arcangeli
  2004-11-30  6:11                     ` David S. Miller
@ 2004-11-30  6:19                     ` David S. Miller
  1 sibling, 0 replies; 19+ messages in thread
From: David S. Miller @ 2004-11-30  6:19 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: marcelo.tosatti, alan, bgagnon, linux-kernel, davem

On Tue, 30 Nov 2004 05:16:56 +0100
Andrea Arcangeli <andrea@suse.de> wrote:

> hmm strange, I would suggest to put a dump_stack here to see where it
> triggers:

Nevermind everyone, it's not the get_user_pages() changes.
It's something else happening in 2.4.29-pre1 which is buggering
up sparc64.

False alarm, sorry :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
  2004-10-21 13:39 O.Sezer
@ 2004-10-21 14:26 ` Andrea Arcangeli
  0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2004-10-21 14:26 UTC (permalink / raw)
  To: O.Sezer; +Cc: linux-kernel

On Thu, Oct 21, 2004 at 04:39:09PM +0300, O.Sezer wrote:
> I can't find to which suse kernel these patch(es) apply. I assume
> your first one comes down to the attached one-liner for vanilla-2.4,
> can you confirm?

yes.

> For your second: I think it needs your 9999_z-get_user_pages_pte_pin-1
> patch applied beforehand?. Without that patch, are there any problems

... plus yet another incremental patch not yet in 2.4.23aa3.  I'll
upload an aa4 soon with all of this included.

> to be fixed? Can you post patches for vanilla kernels, please?

I'll upload an aa4, then the future 9999_z-get_user_pages_pte_pin-2 will
included every known fix to get_user_pages. I hope it's the same for
you. I've a set of incremental fixes, and I'll join all of them together
in 9999_z-get_user_pages_pte_pin-2 (the PageReserved fix will get mixed
into it too, I hope that's not too confusing, the idea is that such
patch will fix all issues in get_user_pages in 2.4).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
@ 2004-10-21 13:39 O.Sezer
  2004-10-21 14:26 ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: O.Sezer @ 2004-10-21 13:39 UTC (permalink / raw)
  To: andrea; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4304 bytes --]

Andrea Arcangeli  wrote:
>> > > That isnt sufficient. Consider anything else taking a reference to the
>> > > page and the refcount going negative. 
>> > 
>> > You mean not going negative? The problem here as I understand here is 
>> > we dont release the count if the PageReserved is set, but we should.
>> 
>> Drivers like the OSS audio drivers move page between Reserved and 
>> unreserved. The count can thus be corrupted.
> 
> the PG_reserved goes away after VM_IO, so forbidding pages with
> PG_reserved of vmas with VM_IO isn't any different as far as I can tell,
> and since PG_reserved is the real offender sure we shouldn't leave a
> check in get_user_pages that explicitly do something if the page is
> reserved, since if the page is reserved at that point we'd need to
> return -EFAULT or BUG_ON.
> 
> Adding the VM_IO patch on top of this is sure a good idea.
> 
> --- sles/mm/memory.c.~1~	2004-10-19 19:34:10.264335488 +0200
> +++ sles/mm/memory.c	2004-10-19 19:58:47.403776160 +0200
> @@ -806,7 +806,7 @@ int get_user_pages(struct task_struct *t
>  			}
>  			if (pages) {
>  				pages[i] = get_page_map(map);
> -				if (!pages[i]) {
> +				if (!pages[i] || PageReserved(pages[i])) {
>  					spin_unlock(&mm->page_table_lock);
>  					while (i--)
>  						page_cache_release(pages[i]);
> @@ -814,8 +814,7 @@ int get_user_pages(struct task_struct *t
>  					goto out;
>  				}
>  				flush_dcache_page(pages[i]);
> -				if (!PageReserved(pages[i]))
> -					page_cache_get(pages[i]);
> +				page_cache_get(pages[i]);
>  			}
>  			if (vmas)
>  				vmas[i] = vma;
> 
> My version of the fix for 2.4 is this, but this fixes as well an issue
> with the zeropage and it's on top of some other fix for COW corruption
> in 2.4 not yet fixed in mainline 2.4. Since 2.4 never checked
> PageReserved like 2.6 does in get_user_pages, 2.4 as worse can suffer a
> memleak.
> 
> --- sles/include/linux/mm.h.~1~	2004-10-18 10:20:53.391823696 +0200
> +++ sles/include/linux/mm.h	2004-10-18 10:47:10.861011928 +0200
> @@ -533,9 +533,8 @@ extern void unpin_pte_page(struct page *
>  
>  static inline void put_user_page_pte_pin(struct page * page)
>  {
> -	if (PagePinned(page))
> -		/* must run before put_page, put_page may free the page */
> -		unpin_pte_page(page);
> +	/* must run before put_page, put_page may free the page */
> +	unpin_pte_page(page);
>  
>  	put_page(page);
>  }
> --- sles/mm/memory.c.~1~	2004-10-18 10:20:54.947587184 +0200
> +++ sles/mm/memory.c	2004-10-18 10:47:49.822088944 +0200
> @@ -530,7 +530,11 @@ void __wait_on_pte_pinned_page(struct pa
>  
>  void unpin_pte_page(struct page *page)
>  {
> -	wait_queue_head_t *waitqueue = page_waitqueue(page);
> +	wait_queue_head_t *waitqueue;
> +
> +	if (!PagePinned(page))
> +		return;
> +	waitqueue = page_waitqueue(page);
>  	if (unlikely(!TestClearPagePinned(page)))
>  		BUG();
>  	smp_mb__after_clear_bit(); 
> @@ -598,17 +602,21 @@ int __get_user_pages(struct task_struct 
>  				 */
>  				if (!map)
>  					goto bad_page;
> -				page_cache_get(map);
> -				if (pte_pin && unlikely(TestSetPagePinned(map))) {
> -					/* fail if this is a duplicate physical page in this kiovec */
> -					int i2 = i;
> -					while (i2--)
> -						if (map == pages[i2]) {
> -							put_page(map);
> -							goto bad_page;
> -						}
> -					/* hold a reference on "map" so we can wait on it */
> -					goto pte_pin_collision;
> +				if (map != ZERO_PAGE(start)) {
> +					if (PageReserved(map))
> +						goto bad_page;
> +					page_cache_get(map);
> +					if (pte_pin && unlikely(TestSetPagePinned(map))) {
> +						/* fail if this is a duplicate physical page in this kiovec */
> +						int i2 = i;
> +						while (i2--)
> +							if (map == pages[i2]) {
> +								put_page(map);
> +								goto bad_page;
> +							}
> +						/* hold a reference on "map" so we can wait on it */
> +						goto pte_pin_collision;
> +					}
>  				}
>  				pages[i] = map;
>  			}

I can't find to which suse kernel these patch(es) apply. I assume
your first one comes down to the attached one-liner for vanilla-2.4,
can you confirm?
For your second: I think it needs your 9999_z-get_user_pages_pte_pin-1
patch applied beforehand?. Without that patch, are there any problems
to be fixed? Can you post patches for vanilla kernels, please?

Regards,
Ozkan Sezer


[-- Attachment #2: 2.4_memory.c-PageReserved.diff --]
[-- Type: text/plain, Size: 360 bytes --]

--- 2.4/mm/memory.c.BAK	2004-10-20 11:49:35.000000000 +0300
+++ 2.4/mm/memory.c	2004-10-21 10:43:01.000000000 +0300
@@ -499,7 +499,7 @@
 				/* FIXME: call the correct function,
 				 * depending on the type of the found page
 				 */
-				if (!pages[i])
+				if (!pages[i] || PageReserved(pages[i]))
 					goto bad_page;
 				page_cache_get(pages[i]);
 			}


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-11-30  6:21 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-14 14:50 Memory leak in 2.4.27 kernel, using mmap raw packet sockets bgagnon
2004-10-15 18:23 ` Marcelo Tosatti
2004-10-17  2:39   ` Alan Cox
2004-10-19 14:35     ` Marcelo Tosatti
2004-10-20 18:43       ` Alan Cox
2004-10-20 23:24         ` Andrea Arcangeli
2004-10-23 14:17           ` Marcelo Tosatti
2004-11-25 15:02     ` Marcelo Tosatti
2004-11-25 20:32       ` Andrea Arcangeli
2004-11-25 17:12         ` Marcelo Tosatti
2004-11-25 23:13           ` Andrea Arcangeli
2004-11-25 19:45             ` Marcelo Tosatti
2004-11-26  1:04               ` Andrea Arcangeli
2004-11-30  4:03                 ` David S. Miller
2004-11-30  4:16                   ` Andrea Arcangeli
2004-11-30  6:11                     ` David S. Miller
2004-11-30  6:19                     ` David S. Miller
2004-10-21 13:39 O.Sezer
2004-10-21 14:26 ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).