linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fix refcounting in hugetlbfs quota handling
@ 2011-08-11  6:40 David Gibson
  2011-08-12  0:48 ` Linus Torvalds
  2011-08-12 22:20 ` [Qemu-devel] " Christoph Hellwig
  0 siblings, 2 replies; 11+ messages in thread
From: David Gibson @ 2011-08-11  6:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Marcelo Tosatti, Jan Kiszka, qemu-devel, agraf, kvm,
	Paul Mackerras, linux-kernel

Linus, please apply

hugetlbfs tracks the current usage of hugepages per hugetlbfs
mountpoint.  To correctly track this when hugepages are released, it
must find the right hugetlbfs super_block from the struct page
available in free_huge_page().

It does this by storing a pointer to the hugepage's struct
address_space in the page_private data.  The hugetlb_{get,put}_quota
functions go from this mapping to the inode and thence to the
super_block.

However, this usage is buggy, because nothing ensures that the
address_space is not freed before all the hugepages that belonged to
it are.  In practice that will usually be the case, but if extra page
references have been taken by e.g. drivers or kvm doing
get_user_pages() then the file, inode and address space may be
destroyed before all the pages.

In addition, the quota functions use the mapping only to get the inode
then the super_block.  However, most of the callers already have the
inode anyway and have to get the mapping from there.

This patch, therefore, stores a pointer to the inode instead of the
address_space in the page private data for hugepages.  More
importantly it correctly adjusts the reference count on the inodes
when they're added to the page private data.  This ensures that the
inode (and therefore the super block) will not be freed before we use
it from free_huge_page.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Index: working-2.6/fs/hugetlbfs/inode.c
===================================================================
--- working-2.6.orig/fs/hugetlbfs/inode.c	2011-08-10 16:45:47.864758406 +1000
+++ working-2.6/fs/hugetlbfs/inode.c	2011-08-10 17:22:21.899638039 +1000
@@ -874,10 +874,10 @@ out_free:
 	return -ENOMEM;
 }
 
-int hugetlb_get_quota(struct address_space *mapping, long delta)
+int hugetlb_get_quota(struct inode *inode, long delta)
 {
 	int ret = 0;
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
+	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb);
 
 	if (sbinfo->free_blocks > -1) {
 		spin_lock(&sbinfo->stat_lock);
@@ -891,9 +891,9 @@ int hugetlb_get_quota(struct address_spa
 	return ret;
 }
 
-void hugetlb_put_quota(struct address_space *mapping, long delta)
+void hugetlb_put_quota(struct inode *inode, long delta)
 {
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
+	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb);
 
 	if (sbinfo->free_blocks > -1) {
 		spin_lock(&sbinfo->stat_lock);
Index: working-2.6/include/linux/hugetlb.h
===================================================================
--- working-2.6.orig/include/linux/hugetlb.h	2011-08-10 16:58:27.952527484 +1000
+++ working-2.6/include/linux/hugetlb.h	2011-08-10 17:22:08.723572707 +1000
@@ -171,8 +171,8 @@ extern const struct file_operations huge
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags);
-int hugetlb_get_quota(struct address_space *mapping, long delta);
-void hugetlb_put_quota(struct address_space *mapping, long delta);
+int hugetlb_get_quota(struct inode *inode, long delta);
+void hugetlb_put_quota(struct inode *inode, long delta);
 
 static inline int is_file_hugepages(struct file *file)
 {
Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c	2011-08-10 16:44:12.212284092 +1000
+++ working-2.6/mm/hugetlb.c	2011-08-10 17:21:49.603477888 +1000
@@ -533,10 +533,12 @@ static void free_huge_page(struct page *
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct address_space *mapping;
+	struct inode *inode;
 
-	mapping = (struct address_space *) page_private(page);
+	inode = (struct inode *) page_private(page);
 	set_page_private(page, 0);
+	iput(inode);
+
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
 	BUG_ON(page_mapcount(page));
@@ -551,8 +553,8 @@ static void free_huge_page(struct page *
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	if (mapping)
-		hugetlb_put_quota(mapping, 1);
+	if (inode)
+		hugetlb_put_quota(inode, 1);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -1035,7 +1037,7 @@ static struct page *alloc_huge_page(stru
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
 	if (chg)
-		if (hugetlb_get_quota(inode->i_mapping, chg))
+		if (hugetlb_get_quota(inode, chg))
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 
 	spin_lock(&hugetlb_lock);
@@ -1045,12 +1047,12 @@ static struct page *alloc_huge_page(stru
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugetlb_put_quota(inode->i_mapping, chg);
+			hugetlb_put_quota(inode, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}
 
-	set_page_private(page, (unsigned long) mapping);
+	set_page_private(page, (unsigned long) igrab(inode));
 
 	vma_commit_reservation(h, vma, addr);
 
@@ -2086,7 +2088,8 @@ static void hugetlb_vm_op_close(struct v
 
 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
-			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
+			hugetlb_put_quota(vma->vm_file->f_mapping->host,
+					  reserve);
 		}
 	}
 }
@@ -2884,7 +2887,7 @@ int hugetlb_reserve_pages(struct inode *
 		return chg;
 
 	/* There must be enough filesystem quota for the mapping */
-	if (hugetlb_get_quota(inode->i_mapping, chg))
+	if (hugetlb_get_quota(inode, chg))
 		return -ENOSPC;
 
 	/*
@@ -2893,7 +2896,7 @@ int hugetlb_reserve_pages(struct inode *
 	 */
 	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
-		hugetlb_put_quota(inode->i_mapping, chg);
+		hugetlb_put_quota(inode, chg);
 		return ret;
 	}
 
@@ -2922,7 +2925,7 @@ void hugetlb_unreserve_pages(struct inod
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);
 
-	hugetlb_put_quota(inode->i_mapping, (chg - freed));
+	hugetlb_put_quota(inode, (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-11  6:40 Fix refcounting in hugetlbfs quota handling David Gibson
@ 2011-08-12  0:48 ` Linus Torvalds
  2011-08-12  4:34   ` Minchan Kim
  2011-08-12 22:20 ` [Qemu-devel] " Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2011-08-12  0:48 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Jan Kiszka, qemu-devel, agraf, kvm,
	Paul Mackerras, linux-kernel, Hugh Dickins, Minchan Kim,
	KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki

On Wed, Aug 10, 2011 at 11:40 PM, David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> This patch, therefore, stores a pointer to the inode instead of the
> address_space in the page private data for hugepages.  More
> importantly it correctly adjusts the reference count on the inodes
> when they're added to the page private data.  This ensures that the
> inode (and therefore the super block) will not be freed before we use
> it from free_huge_page.

Looks sane, but I *really* want some acks from people who use/know
hugetlbfs. Who would that be? I'm adding random people who have
acked/signed-off patches to hugetlbfs recently..

             Linus

--- patch left quoted for new people ---
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>
> Index: working-2.6/fs/hugetlbfs/inode.c
> ===================================================================
> --- working-2.6.orig/fs/hugetlbfs/inode.c       2011-08-10 16:45:47.864758406 +1000
> +++ working-2.6/fs/hugetlbfs/inode.c    2011-08-10 17:22:21.899638039 +1000
> @@ -874,10 +874,10 @@ out_free:
>        return -ENOMEM;
>  }
>
> -int hugetlb_get_quota(struct address_space *mapping, long delta)
> +int hugetlb_get_quota(struct inode *inode, long delta)
>  {
>        int ret = 0;
> -       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> +       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb);
>
>        if (sbinfo->free_blocks > -1) {
>                spin_lock(&sbinfo->stat_lock);
> @@ -891,9 +891,9 @@ int hugetlb_get_quota(struct address_spa
>        return ret;
>  }
>
> -void hugetlb_put_quota(struct address_space *mapping, long delta)
> +void hugetlb_put_quota(struct inode *inode, long delta)
>  {
> -       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> +       struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb);
>
>        if (sbinfo->free_blocks > -1) {
>                spin_lock(&sbinfo->stat_lock);
> Index: working-2.6/include/linux/hugetlb.h
> ===================================================================
> --- working-2.6.orig/include/linux/hugetlb.h    2011-08-10 16:58:27.952527484 +1000
> +++ working-2.6/include/linux/hugetlb.h 2011-08-10 17:22:08.723572707 +1000
> @@ -171,8 +171,8 @@ extern const struct file_operations huge
>  extern const struct vm_operations_struct hugetlb_vm_ops;
>  struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
>                                struct user_struct **user, int creat_flags);
> -int hugetlb_get_quota(struct address_space *mapping, long delta);
> -void hugetlb_put_quota(struct address_space *mapping, long delta);
> +int hugetlb_get_quota(struct inode *inode, long delta);
> +void hugetlb_put_quota(struct inode *inode, long delta);
>
>  static inline int is_file_hugepages(struct file *file)
>  {
> Index: working-2.6/mm/hugetlb.c
> ===================================================================
> --- working-2.6.orig/mm/hugetlb.c       2011-08-10 16:44:12.212284092 +1000
> +++ working-2.6/mm/hugetlb.c    2011-08-10 17:21:49.603477888 +1000
> @@ -533,10 +533,12 @@ static void free_huge_page(struct page *
>         */
>        struct hstate *h = page_hstate(page);
>        int nid = page_to_nid(page);
> -       struct address_space *mapping;
> +       struct inode *inode;
>
> -       mapping = (struct address_space *) page_private(page);
> +       inode = (struct inode *) page_private(page);
>        set_page_private(page, 0);
> +       iput(inode);
> +
>        page->mapping = NULL;
>        BUG_ON(page_count(page));
>        BUG_ON(page_mapcount(page));
> @@ -551,8 +553,8 @@ static void free_huge_page(struct page *
>                enqueue_huge_page(h, page);
>        }
>        spin_unlock(&hugetlb_lock);
> -       if (mapping)
> -               hugetlb_put_quota(mapping, 1);
> +       if (inode)
> +               hugetlb_put_quota(inode, 1);
>  }
>
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -1035,7 +1037,7 @@ static struct page *alloc_huge_page(stru
>        if (chg < 0)
>                return ERR_PTR(-VM_FAULT_OOM);
>        if (chg)
> -               if (hugetlb_get_quota(inode->i_mapping, chg))
> +               if (hugetlb_get_quota(inode, chg))
>                        return ERR_PTR(-VM_FAULT_SIGBUS);
>
>        spin_lock(&hugetlb_lock);
> @@ -1045,12 +1047,12 @@ static struct page *alloc_huge_page(stru
>        if (!page) {
>                page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>                if (!page) {
> -                       hugetlb_put_quota(inode->i_mapping, chg);
> +                       hugetlb_put_quota(inode, chg);
>                        return ERR_PTR(-VM_FAULT_SIGBUS);
>                }
>        }
>
> -       set_page_private(page, (unsigned long) mapping);
> +       set_page_private(page, (unsigned long) igrab(inode));
>
>        vma_commit_reservation(h, vma, addr);
>
> @@ -2086,7 +2088,8 @@ static void hugetlb_vm_op_close(struct v
>
>                if (reserve) {
>                        hugetlb_acct_memory(h, -reserve);
> -                       hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
> +                       hugetlb_put_quota(vma->vm_file->f_mapping->host,
> +                                         reserve);
>                }
>        }
>  }
> @@ -2884,7 +2887,7 @@ int hugetlb_reserve_pages(struct inode *
>                return chg;
>
>        /* There must be enough filesystem quota for the mapping */
> -       if (hugetlb_get_quota(inode->i_mapping, chg))
> +       if (hugetlb_get_quota(inode, chg))
>                return -ENOSPC;
>
>        /*
> @@ -2893,7 +2896,7 @@ int hugetlb_reserve_pages(struct inode *
>         */
>        ret = hugetlb_acct_memory(h, chg);
>        if (ret < 0) {
> -               hugetlb_put_quota(inode->i_mapping, chg);
> +               hugetlb_put_quota(inode, chg);
>                return ret;
>        }
>
> @@ -2922,7 +2925,7 @@ void hugetlb_unreserve_pages(struct inod
>        inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>        spin_unlock(&inode->i_lock);
>
> -       hugetlb_put_quota(inode->i_mapping, (chg - freed));
> +       hugetlb_put_quota(inode, (chg - freed));
>        hugetlb_acct_memory(h, -(chg - freed));
>  }
>
>
> --
> David Gibson                    | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
>                                | _way_ _around_!
> http://www.ozlabs.org/~dgibson
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-12  0:48 ` Linus Torvalds
@ 2011-08-12  4:34   ` Minchan Kim
  2011-08-12 19:15     ` Hugh Dickins
  0 siblings, 1 reply; 11+ messages in thread
From: Minchan Kim @ 2011-08-12  4:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Marcelo Tosatti, Jan Kiszka, qemu-devel, agraf, kvm,
	Paul Mackerras, linux-kernel, Hugh Dickins, KOSAKI Motohiro,
	Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman, Andrew Hastings,
	Adam Litke

On Fri, Aug 12, 2011 at 9:48 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Aug 10, 2011 at 11:40 PM, David Gibson
> <david@gibson.dropbear.id.au> wrote:
>>
>> This patch, therefore, stores a pointer to the inode instead of the
>> address_space in the page private data for hugepages.  More
>> importantly it correctly adjusts the reference count on the inodes
>> when they're added to the page private data.  This ensures that the
>> inode (and therefore the super block) will not be freed before we use
>> it from free_huge_page.
>
> Looks sane, but I *really* want some acks from people who use/know
> hugetlbfs. Who would that be? I'm adding random people who have
> acked/signed-off patches to hugetlbfs recently..

At least, code itself looks good to me but your random choice was failed.
Maybe people you want are as follows.
http://marc.info/?t=126928975800003&r=1&w=2

Ccing right persons.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-12  4:34   ` Minchan Kim
@ 2011-08-12 19:15     ` Hugh Dickins
  2011-08-13  1:08       ` David Gibson
  0 siblings, 1 reply; 11+ messages in thread
From: Hugh Dickins @ 2011-08-12 19:15 UTC (permalink / raw)
  To: David Gibson
  Cc: Linus Torvalds, Avi Kivity, Marcelo Tosatti, Jan Kiszka,
	qemu-devel, agraf, kvm, Paul Mackerras, linux-kernel,
	KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman,
	Andrew Hastings, Adam Litke, Minchan Kim

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3167 bytes --]

On Fri, 12 Aug 2011, Minchan Kim wrote:
> On Fri, Aug 12, 2011 at 9:48 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Wed, Aug 10, 2011 at 11:40 PM, David Gibson
> > <david@gibson.dropbear.id.au> wrote:
> >>
> >> This patch, therefore, stores a pointer to the inode instead of the
> >> address_space in the page private data for hugepages.  More
> >> importantly it correctly adjusts the reference count on the inodes
> >> when they're added to the page private data.  This ensures that the
> >> inode (and therefore the super block) will not be freed before we use
> >> it from free_huge_page.
> >
> > Looks sane, but I *really* want some acks from people who use/know
> > hugetlbfs. Who would that be? I'm adding random people who have
> > acked/signed-off patches to hugetlbfs recently..
> 
> At least, code itself looks good to me but your random choice was failed.
> Maybe people you want are as follows.
> http://marc.info/?t=126928975800003&r=1&w=2
> 
> Ccing right persons.

I don't know much about hugetlbfs these days, but I think the patch
is very wrong.

The real change is where alloc_huge_page() does igrab(inode) and
free_huge_pages() does iput(inode)?

That makes me very nervous, partly because a final iput() is a complex
operation, which we wouldn't expect to be doing when "freeing" a page.

My first worry was that free_huge_page() could actually get called at
interrupt time (when it's in a pagevec of pages to be freed as a batch,
then another put_page is done at interrupt time which frees that batch):
I worried that we use spin_lock not spin_lock_irqsave on inode->i_lock.
To be honest though, I've not followed up whether that's actually a
possibility, the compound page path is too twisty for a quick answer;
and even if it's a possibility, it's one that's already ignored in the
case of hugetlb_lock.

Setting that aside, I think this thing of grabbing a reference to inode
for each page just does not work as you wish: when we unlink an inode,
all its pages should be freed; but because they are themselves holding
references to the inode, it and its pages stick around forever.

A quick experiment with your patch versus without confirmed that:
meminfo HugePages_Free stayed down with your patch, but went back to
HugePages_Total without it.  Please check, perhaps I'm just mistaken.

Sorry, I've not looked into what a constructive alternative might be;
and it's not the first time we've had this difficulty - it came up last
year when the ->freepage function was added, that the inode may be gone
by the time ->freepage(page) is called.

On a side note, very good description - thank you, but I wish you'd
split the patch into two, the fix and then the inode-instead-of-mapping
cleanup.  Though personally I'd prefer not to make that "cleanup": it's
normal for a struct address space * to be used in struct page (if I delved
I guess I'd find good reason why this one is in page->private instead of
page->mapping: perhaps because it's needed after page->mapping is reset
to NULL, perhaps because it's needed on COWed copies of hugetlbfs pages).

Hugh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] Fix refcounting in hugetlbfs quota handling
  2011-08-11  6:40 Fix refcounting in hugetlbfs quota handling David Gibson
  2011-08-12  0:48 ` Linus Torvalds
@ 2011-08-12 22:20 ` Christoph Hellwig
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2011-08-12 22:20 UTC (permalink / raw)
  To: Linus Torvalds, Avi Kivity, Marcelo Tosatti, Jan Kiszka,
	qemu-devel, agraf, kvm, Paul Mackerras, linux-kernel

On Thu, Aug 11, 2011 at 04:40:59PM +1000, David Gibson wrote:
> Linus, please apply
> 
> hugetlbfs tracks the current usage of hugepages per hugetlbfs
> mountpoint.  To correctly track this when hugepages are released, it
> must find the right hugetlbfs super_block from the struct page
> available in free_huge_page().

a superblock is not a mountpoint, it's a filesystem instance.  You can happily
have a single filesystem mounted at multiple mount points.

> However, this usage is buggy, because nothing ensures that the
> address_space is not freed before all the hugepages that belonged to
> it are.  In practice that will usually be the case, but if extra page
> references have been taken by e.g. drivers or kvm doing
> get_user_pages() then the file, inode and address space may be
> destroyed before all the pages.
> 
> In addition, the quota functions use the mapping only to get the inode
> then the super_block.  However, most of the callers already have the
> inode anyway and have to get the mapping from there.
> 
> This patch, therefore, stores a pointer to the inode instead of the
> address_space in the page private data for hugepages.

What's sthe point?  The lifetime of inode->i_mapping is exactly the
same as that of the inode, except for those few filesystem that use
one from a different inode (and then for the whole lifetime of the
inode), so I can't see how your patch will make a difference.

> More
> importantly it correctly adjusts the reference count on the inodes
> when they're added to the page private data.  This ensures that the
> inode (and therefore the super block) will not be freed before we use
> it from free_huge_page.

That seems like the real fix.  And even if you'd still do the other bits
it should be a separate patch/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-12 19:15     ` Hugh Dickins
@ 2011-08-13  1:08       ` David Gibson
  2011-08-15 18:00         ` Hugh Dickins
  0 siblings, 1 reply; 11+ messages in thread
From: David Gibson @ 2011-08-13  1:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Avi Kivity, Marcelo Tosatti, Jan Kiszka,
	qemu-devel, agraf, kvm, Paul Mackerras, linux-kernel,
	KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman,
	Andrew Hastings, Adam Litke, Minchan Kim

On Fri, Aug 12, 2011 at 12:15:21PM -0700, Hugh Dickins wrote:
> On Fri, 12 Aug 2011, Minchan Kim wrote:
> > On Fri, Aug 12, 2011 at 9:48 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Wed, Aug 10, 2011 at 11:40 PM, David Gibson
> > > <david@gibson.dropbear.id.au> wrote:
> > >>
> > >> This patch, therefore, stores a pointer to the inode instead of the
> > >> address_space in the page private data for hugepages.  More
> > >> importantly it correctly adjusts the reference count on the inodes
> > >> when they're added to the page private data.  This ensures that the
> > >> inode (and therefore the super block) will not be freed before we use
> > >> it from free_huge_page.
> > >
> > > Looks sane, but I *really* want some acks from people who use/know
> > > hugetlbfs. Who would that be? I'm adding random people who have
> > > acked/signed-off patches to hugetlbfs recently..
> > 
> > At least, code itself looks good to me but your random choice was failed.
> > Maybe people you want are as follows.
> > http://marc.info/?t=126928975800003&r=1&w=2
> > 
> > Ccing right persons.
> 
> I don't know much about hugetlbfs these days, but I think the patch
> is very wrong.
> 
> The real change is where alloc_huge_page() does igrab(inode) and
> free_huge_pages() does iput(inode)?
> 
> That makes me very nervous, partly because a final iput() is a complex
> operation, which we wouldn't expect to be doing when "freeing" a page.
> 
> My first worry was that free_huge_page() could actually get called at
> interrupt time (when it's in a pagevec of pages to be freed as a batch,
> then another put_page is done at interrupt time which frees that batch):
> I worried that we use spin_lock not spin_lock_irqsave on inode->i_lock.
> To be honest though, I've not followed up whether that's actually a
> possibility, the compound page path is too twisty for a quick answer;
> and even if it's a possibility, it's one that's already ignored in the
> case of hugetlb_lock.
> 
> Setting that aside, I think this thing of grabbing a reference to inode
> for each page just does not work as you wish: when we unlink an inode,
> all its pages should be freed; but because they are themselves holding
> references to the inode, it and its pages stick around forever.

Ugh, yes.  You're absolutely right.  That circular reference will mess
everything up.  Thinking it through and testing fail.

> A quick experiment with your patch versus without confirmed that:
> meminfo HugePages_Free stayed down with your patch, but went back to
> HugePages_Total without it.  Please check, perhaps I'm just mistaken.
> 
> Sorry, I've not looked into what a constructive alternative might be;
> and it's not the first time we've had this difficulty - it came up last
> year when the ->freepage function was added, that the inode may be gone
> by the time ->freepage(page) is called.

Ok, so.  In fact the quota functions we call at free time only need
the super block, not the inode per se.  If we put a superblock pointer
instead of an inode pointer in page private, and refcounted that, I
think that should remove the circular ref.  The only reason I didn't
do it before is that the superblock refcounting functions didn't seem
to be globally visible in an obvious way.

Does that sound like a reasonable approach?

> On a side note, very good description - thank you, but I wish you'd
> split the patch into two, the fix and then the inode-instead-of-mapping
> cleanup.  Though personally I'd prefer not to make that "cleanup": it's
> normal for a struct address space * to be used in struct page (if I delved
> I guess I'd find good reason why this one is in page->private instead of
> page->mapping: perhaps because it's needed after page->mapping is reset
> to NULL, perhaps because it's needed on COWed copies of hugetlbfs pages).

That is an interesting question.  But it doesn't address the basic
point.  mappings aren't refcounted themselves, and as far as I can
tell their lifetime is bound to that of their inode.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-13  1:08       ` David Gibson
@ 2011-08-15 18:00         ` Hugh Dickins
  2011-08-15 20:25           ` Andrew Barry
  0 siblings, 1 reply; 11+ messages in thread
From: Hugh Dickins @ 2011-08-15 18:00 UTC (permalink / raw)
  To: David Gibson
  Cc: David Gibson, Linus Torvalds, Avi Kivity, Marcelo Tosatti,
	Jan Kiszka, qemu-devel, agraf, kvm, Paul Mackerras, linux-kernel,
	KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman,
	Andrew Hastings, Adam Litke, Minchan Kim

On Sat, 13 Aug 2011, David Gibson wrote:
> On Fri, Aug 12, 2011 at 12:15:21PM -0700, Hugh Dickins wrote:
> > 
> > Setting that aside, I think this thing of grabbing a reference to inode
> > for each page just does not work as you wish: when we unlink an inode,
> > all its pages should be freed; but because they are themselves holding
> > references to the inode, it and its pages stick around forever.
> 
> Ugh, yes.  You're absolutely right.  That circular reference will mess
> everything up.  Thinking it through and testing fail.
> 
> > A quick experiment with your patch versus without confirmed that:
> > meminfo HugePages_Free stayed down with your patch, but went back to
> > HugePages_Total without it.  Please check, perhaps I'm just mistaken.
> > 
> > Sorry, I've not looked into what a constructive alternative might be;
> > and it's not the first time we've had this difficulty - it came up last
> > year when the ->freepage function was added, that the inode may be gone
> > by the time ->freepage(page) is called.
> 
> Ok, so.  In fact the quota functions we call at free time only need
> the super block, not the inode per se.  If we put a superblock pointer
> instead of an inode pointer in page private, and refcounted that, I
> think that should remove the circular ref.  The only reason I didn't
> do it before is that the superblock refcounting functions didn't seem
> to be globally visible in an obvious way.
> 
> Does that sound like a reasonable approach?

That does sound closer to a reaonable approach, but my guess is that it
will suck you into a world of superblock mutexes and semaphores, which
you cannot take at free_huge_page() time.

It might be necessary to hang your own tiny structure off the superblock,
with one refcount for the superblock, and one for each hugepage attached,
you freeing that structure when the count goes down to zero from either
direction.

Whatever you do needs testing with lockdep and atomic sleep checks.

I do dislike tying these separate levels together in such an unusual way,
but it is a difficult problem and I don't know of an easy answer.  Maybe
we'll need to find a better answer for other reasons, it does come up
from time to time e.g. recent race between evicting inode and nrpages
going down to 0.

You might care to take a look at how tmpfs (mm/shmem.c) deals with
the equivalent issue there (sbinfo->used_blocks).  But I expect you to
conclude that hugetlbfs cannot afford the kind of approximations that
tmpfs can afford.

Although I think tmpfs is more correct, to be associating the count
with pagecache (so the count goes down as soon as a page is truncated
or evicted from pagecache), your fewer and huger pages, and reservation
conventions, may well demand that the count stays up until the page is
actually freed back to hugepool.  And let's not pretend that what tmpfs
does is wonderful: the strange shmem_recalc_inode() tries its best to
notice when memory pressure has freed clean pages, but it never looks
beyond the inode being accessed at the times it's called.  Not at all
satisfactory, but not actually an issue in practice, since we stopped
allocating pages for simple reads from sparse file.  I did want to
convert tmpfs to use ->freepage(), but couldn't manage it without
stable mapping - same problem as you have.

Hugh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-15 18:00         ` Hugh Dickins
@ 2011-08-15 20:25           ` Andrew Barry
  2011-08-16  3:47             ` David Gibson
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Barry @ 2011-08-15 20:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Gibson, David Gibson, Linus Torvalds, Avi Kivity,
	Marcelo Tosatti, Jan Kiszka, qemu-devel, agraf, kvm,
	Paul Mackerras, linux-kernel, KOSAKI Motohiro, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Andrew Hastings, Adam Litke,
	Minchan Kim

I've been doing something similar to this last proposal. I put a
hugetlbfs_sb_info pointer into page_private, and dropped a reference counter and
an active/inactive bit into the hugetlbfs_sb_info struct. At Umount time, the
sbinfo is freed, only if the reference count is zero. Otherwise, the last
put_quota frees the sbinfo structure. This fixed the race we were seeing between
umount and a put_quota from an rdma transaction. I just gave it a cursory test
on a 3.0 kernel; it has seen quite a lot more testing on a 2.6.32-derived
kernel, with no more hits of the umount race.

Does this address the problems you were thinking about?
-Andrew Barry

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 87b6e04..2ed1cca 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -615,8 +615,12 @@ static void hugetlbfs_put_super(struct s
 	struct hugetlbfs_sb_info *sbi = HUGETLBFS_SB(sb);

 	if (sbi) {
+		sbi->active = HPAGE_INACTIVE;
 		sb->s_fs_info = NULL;
-		kfree(sbi);
+
+		/*Free only if used quota is zero. */
+		if (sbi->used_blocks == 0)
+			kfree(sbi);
 	}
 }

@@ -851,6 +855,8 @@ hugetlbfs_fill_super(struct super_block
 	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
+	sbinfo->used_blocks = 0;
+	sbinfo->active = HPAGE_ACTIVE;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = huge_page_size(config.hstate);
 	sb->s_blocksize_bits = huge_page_shift(config.hstate);
@@ -874,30 +880,36 @@ out_free:
 	return -ENOMEM;
 }

-int hugetlb_get_quota(struct address_space *mapping, long delta)
+int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
 {
 	int ret = 0;
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
-		if (sbinfo->free_blocks - delta >= 0)
+	spin_lock(&sbinfo->stat_lock);
+	if ((sbinfo->free_blocks == -1) || (sbinfo->free_blocks - delta >= 0)) {
+		if (sbinfo->free_blocks != -1)
 			sbinfo->free_blocks -= delta;
-		else
-			ret = -ENOMEM;
-		spin_unlock(&sbinfo->stat_lock);
+		sbinfo->used_blocks += delta;
+		sbinfo->active = HPAGE_ACTIVE;
+	} else {
+		ret = -ENOMEM;
 	}
+	spin_unlock(&sbinfo->stat_lock);

 	return ret;
 }

-void hugetlb_put_quota(struct address_space *mapping, long delta)
+void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
 {
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
-
-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
+	spin_lock(&sbinfo->stat_lock);
+	if (sbinfo->free_blocks > -1)
 		sbinfo->free_blocks += delta;
+	sbinfo->used_blocks -= delta;
+	/* If hugetlbfs_put_super couldn't free sbinfo due to
+	* an outstanding quota reference, free it now. */
+	if ((sbinfo->used_blocks == 0) && (sbinfo->active == HPAGE_INACTIVE)) {
+		spin_unlock(&sbinfo->stat_lock);
+		kfree(sbinfo);
+	} else {
 		spin_unlock(&sbinfo->stat_lock);
 	}
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 19644e0..8780a91 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -142,11 +142,16 @@ struct hugetlbfs_config {
 	struct hstate *hstate;
 };

+#define HPAGE_INACTIVE  0
+#define HPAGE_ACTIVE    1
+
 struct hugetlbfs_sb_info {
 	long	max_blocks;   /* blocks allowed */
 	long	free_blocks;  /* blocks free */
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
+	long	used_blocks;  /* blocks used */
+	long	active;		  /* active bit */
 	spinlock_t	stat_lock;
 	struct hstate *hstate;
 };
@@ -171,8 +176,8 @@ extern const struct file_operations huge
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags);
-int hugetlb_get_quota(struct address_space *mapping, long delta);
-void hugetlb_put_quota(struct address_space *mapping, long delta);
+int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta);
+void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta);

 static inline int is_file_hugepages(struct file *file)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dae27ba..cf26ae9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -533,9 +533,9 @@ static void free_huge_page(struct page *
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct address_space *mapping;
+	struct hugetlbfs_sb_info *sbinfo;

-	mapping = (struct address_space *) page_private(page);
+	sbinfo = ( struct hugetlbfs_sb_info *) page_private(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));
@@ -551,8 +551,8 @@ static void free_huge_page(struct page *
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	if (mapping)
-		hugetlb_put_quota(mapping, 1);
+	if (sbinfo)
+		hugetlb_put_quota(sbinfo, 1);
 }

 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
@@ -1035,7 +1035,7 @@ static struct page *alloc_huge_page(stru
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
 	if (chg)
-		if (hugetlb_get_quota(inode->i_mapping, chg))
+		if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg))
 			return ERR_PTR(-VM_FAULT_SIGBUS);

 	spin_lock(&hugetlb_lock);
@@ -1045,12 +1045,12 @@ static struct page *alloc_huge_page(stru
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugetlb_put_quota(inode->i_mapping, chg);
+			hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}

-	set_page_private(page, (unsigned long) mapping);
+	set_page_private(page, (unsigned long)
HUGETLBFS_SB(inode->i_mapping->host->i_sb));

 	vma_commit_reservation(h, vma, addr);

@@ -2086,7 +2086,7 @@ static void hugetlb_vm_op_close(struct v

 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
-			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
+			hugetlb_put_quota(HUGETLBFS_SB(vma->vm_file->f_mapping->host->i_sb), reserve);
 		}
 	}
 }
@@ -2884,7 +2884,7 @@ int hugetlb_reserve_pages(struct inode *
 		return chg;

 	/* There must be enough filesystem quota for the mapping */
-	if (hugetlb_get_quota(inode->i_mapping, chg))
+	if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg))
 		return -ENOSPC;

 	/*
@@ -2893,7 +2893,7 @@ int hugetlb_reserve_pages(struct inode *
 	 */
 	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
-		hugetlb_put_quota(inode->i_mapping, chg);
+		hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg);
 		return ret;
 	}

@@ -2922,7 +2922,7 @@ void hugetlb_unreserve_pages(struct inod
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);

-	hugetlb_put_quota(inode->i_mapping, (chg - freed));
+	hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
 }



On 08/15/2011 01:00 PM, Hugh Dickins wrote:
> On Sat, 13 Aug 2011, David Gibson wrote:
>> On Fri, Aug 12, 2011 at 12:15:21PM -0700, Hugh Dickins wrote:
>>>
>>> Setting that aside, I think this thing of grabbing a reference to inode
>>> for each page just does not work as you wish: when we unlink an inode,
>>> all its pages should be freed; but because they are themselves holding
>>> references to the inode, it and its pages stick around forever.
>>
>> Ugh, yes.  You're absolutely right.  That circular reference will mess
>> everything up.  Thinking it through and testing fail.
>>
>>> A quick experiment with your patch versus without confirmed that:
>>> meminfo HugePages_Free stayed down with your patch, but went back to
>>> HugePages_Total without it.  Please check, perhaps I'm just mistaken.
>>>
>>> Sorry, I've not looked into what a constructive alternative might be;
>>> and it's not the first time we've had this difficulty - it came up last
>>> year when the ->freepage function was added, that the inode may be gone
>>> by the time ->freepage(page) is called.
>>
>> Ok, so.  In fact the quota functions we call at free time only need
>> the super block, not the inode per se.  If we put a superblock pointer
>> instead of an inode pointer in page private, and refcounted that, I
>> think that should remove the circular ref.  The only reason I didn't
>> do it before is that the superblock refcounting functions didn't seem
>> to be globally visible in an obvious way.
>>
>> Does that sound like a reasonable approach?
> 
> That does sound closer to a reaonable approach, but my guess is that it
> will suck you into a world of superblock mutexes and semaphores, which
> you cannot take at free_huge_page() time.
> 
> It might be necessary to hang your own tiny structure off the superblock,
> with one refcount for the superblock, and one for each hugepage attached,
> you freeing that structure when the count goes down to zero from either
> direction.
> 
> Whatever you do needs testing with lockdep and atomic sleep checks.
> 
> I do dislike tying these separate levels together in such an unusual way,
> but it is a difficult problem and I don't know of an easy answer.  Maybe
> we'll need to find a better answer for other reasons, it does come up
> from time to time e.g. recent race between evicting inode and nrpages
> going down to 0.
> 
> You might care to take a look at how tmpfs (mm/shmem.c) deals with
> the equivalent issue there (sbinfo->used_blocks).  But I expect you to
> conclude that hugetlbfs cannot afford the kind of approximations that
> tmpfs can afford.
> 
> Although I think tmpfs is more correct, to be associating the count
> with pagecache (so the count goes down as soon as a page is truncated
> or evicted from pagecache), your fewer and huger pages, and reservation
> conventions, may well demand that the count stays up until the page is
> actually freed back to hugepool.  And let's not pretend that what tmpfs
> does is wonderful: the strange shmem_recalc_inode() tries its best to
> notice when memory pressure has freed clean pages, but it never looks
> beyond the inode being accessed at the times it's called.  Not at all
> satisfactory, but not actually an issue in practice, since we stopped
> allocating pages for simple reads from sparse file.  I did want to
> convert tmpfs to use ->freepage(), but couldn't manage it without
> stable mapping - same problem as you have.
> 
> Hugh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-15 20:25           ` Andrew Barry
@ 2011-08-16  3:47             ` David Gibson
  2011-08-16 17:45               ` [RFC PATCH] mm/hugepages: Fix race between hugetlbfs umount and quota update Andrew Barry
  2011-08-16 17:45               ` Fix refcounting in hugetlbfs quota handling Andrew Barry
  0 siblings, 2 replies; 11+ messages in thread
From: David Gibson @ 2011-08-16  3:47 UTC (permalink / raw)
  To: Andrew Barry
  Cc: Hugh Dickins, Linus Torvalds, Avi Kivity, Marcelo Tosatti,
	Jan Kiszka, qemu-devel, agraf, kvm, Paul Mackerras, linux-kernel,
	KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman,
	Andrew Hastings, Adam Litke, Minchan Kim

On Mon, Aug 15, 2011 at 03:25:35PM -0500, Andrew Barry wrote:
> I've been doing something similar to this last proposal. I put a
> hugetlbfs_sb_info pointer into page_private, and dropped a reference counter and
> an active/inactive bit into the hugetlbfs_sb_info struct. At Umount time, the
> sbinfo is freed, only if the reference count is zero. Otherwise, the last
> put_quota frees the sbinfo structure. This fixed the race we were seeing between
> umount and a put_quota from an rdma transaction. I just gave it a cursory test
> on a 3.0 kernel; it has seen quite a lot more testing on a 2.6.32-derived
> kernel, with no more hits of the umount race.
> 
> Does this address the problems you were thinking about?

Ah, this looks much better than my patch.  And the fact that you've
seen your race demonstrates clearly that this isn't just a kvm
problem.  I hope we can push this upstream very soon - what can I do
to help?

> -Andrew Barry
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 87b6e04..2ed1cca 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -615,8 +615,12 @@ static void hugetlbfs_put_super(struct s
>  	struct hugetlbfs_sb_info *sbi = HUGETLBFS_SB(sb);
> 
>  	if (sbi) {
> +		sbi->active = HPAGE_INACTIVE;
>  		sb->s_fs_info = NULL;
> -		kfree(sbi);
> +
> +		/*Free only if used quota is zero. */
> +		if (sbi->used_blocks == 0)
> +			kfree(sbi);
>  	}
>  }
> 
> @@ -851,6 +855,8 @@ hugetlbfs_fill_super(struct super_block
>  	sbinfo->free_blocks = config.nr_blocks;
>  	sbinfo->max_inodes = config.nr_inodes;
>  	sbinfo->free_inodes = config.nr_inodes;
> +	sbinfo->used_blocks = 0;
> +	sbinfo->active = HPAGE_ACTIVE;
>  	sb->s_maxbytes = MAX_LFS_FILESIZE;
>  	sb->s_blocksize = huge_page_size(config.hstate);
>  	sb->s_blocksize_bits = huge_page_shift(config.hstate);
> @@ -874,30 +880,36 @@ out_free:
>  	return -ENOMEM;
>  }
> 
> -int hugetlb_get_quota(struct address_space *mapping, long delta)
> +int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
>  {
>  	int ret = 0;
> -	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> 
> -	if (sbinfo->free_blocks > -1) {
> -		spin_lock(&sbinfo->stat_lock);
> -		if (sbinfo->free_blocks - delta >= 0)
> +	spin_lock(&sbinfo->stat_lock);
> +	if ((sbinfo->free_blocks == -1) || (sbinfo->free_blocks - delta >= 0)) {
> +		if (sbinfo->free_blocks != -1)
>  			sbinfo->free_blocks -= delta;
> -		else
> -			ret = -ENOMEM;
> -		spin_unlock(&sbinfo->stat_lock);
> +		sbinfo->used_blocks += delta;
> +		sbinfo->active = HPAGE_ACTIVE;
> +	} else {
> +		ret = -ENOMEM;
>  	}
> +	spin_unlock(&sbinfo->stat_lock);
> 
>  	return ret;
>  }
> 
> -void hugetlb_put_quota(struct address_space *mapping, long delta)
> +void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
>  {
> -	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> -
> -	if (sbinfo->free_blocks > -1) {
> -		spin_lock(&sbinfo->stat_lock);
> +	spin_lock(&sbinfo->stat_lock);
> +	if (sbinfo->free_blocks > -1)
>  		sbinfo->free_blocks += delta;
> +	sbinfo->used_blocks -= delta;
> +	/* If hugetlbfs_put_super couldn't free sbinfo due to
> +	* an outstanding quota reference, free it now. */
> +	if ((sbinfo->used_blocks == 0) && (sbinfo->active == HPAGE_INACTIVE)) {
> +		spin_unlock(&sbinfo->stat_lock);
> +		kfree(sbinfo);
> +	} else {
>  		spin_unlock(&sbinfo->stat_lock);
>  	}
>  }
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 19644e0..8780a91 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -142,11 +142,16 @@ struct hugetlbfs_config {
>  	struct hstate *hstate;
>  };
> 
> +#define HPAGE_INACTIVE  0
> +#define HPAGE_ACTIVE    1
> +
>  struct hugetlbfs_sb_info {
>  	long	max_blocks;   /* blocks allowed */
>  	long	free_blocks;  /* blocks free */
>  	long	max_inodes;   /* inodes allowed */
>  	long	free_inodes;  /* inodes free */
> +	long	used_blocks;  /* blocks used */
> +	long	active;		  /* active bit */
>  	spinlock_t	stat_lock;
>  	struct hstate *hstate;
>  };
> @@ -171,8 +176,8 @@ extern const struct file_operations huge
>  extern const struct vm_operations_struct hugetlb_vm_ops;
>  struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
>  				struct user_struct **user, int creat_flags);
> -int hugetlb_get_quota(struct address_space *mapping, long delta);
> -void hugetlb_put_quota(struct address_space *mapping, long delta);
> +int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta);
> +void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta);
> 
>  static inline int is_file_hugepages(struct file *file)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dae27ba..cf26ae9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -533,9 +533,9 @@ static void free_huge_page(struct page *
>  	 */
>  	struct hstate *h = page_hstate(page);
>  	int nid = page_to_nid(page);
> -	struct address_space *mapping;
> +	struct hugetlbfs_sb_info *sbinfo;
> 
> -	mapping = (struct address_space *) page_private(page);
> +	sbinfo = ( struct hugetlbfs_sb_info *) page_private(page);
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	BUG_ON(page_count(page));
> @@ -551,8 +551,8 @@ static void free_huge_page(struct page *
>  		enqueue_huge_page(h, page);
>  	}
>  	spin_unlock(&hugetlb_lock);
> -	if (mapping)
> -		hugetlb_put_quota(mapping, 1);
> +	if (sbinfo)
> +		hugetlb_put_quota(sbinfo, 1);
>  }
> 
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> @@ -1035,7 +1035,7 @@ static struct page *alloc_huge_page(stru
>  	if (chg < 0)
>  		return ERR_PTR(-VM_FAULT_OOM);
>  	if (chg)
> -		if (hugetlb_get_quota(inode->i_mapping, chg))
> +		if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg))
>  			return ERR_PTR(-VM_FAULT_SIGBUS);
> 
>  	spin_lock(&hugetlb_lock);
> @@ -1045,12 +1045,12 @@ static struct page *alloc_huge_page(stru
>  	if (!page) {
>  		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>  		if (!page) {
> -			hugetlb_put_quota(inode->i_mapping, chg);
> +			hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg);
>  			return ERR_PTR(-VM_FAULT_SIGBUS);
>  		}
>  	}
> 
> -	set_page_private(page, (unsigned long) mapping);
> +	set_page_private(page, (unsigned long)
> HUGETLBFS_SB(inode->i_mapping->host->i_sb));
> 
>  	vma_commit_reservation(h, vma, addr);
> 
> @@ -2086,7 +2086,7 @@ static void hugetlb_vm_op_close(struct v
> 
>  		if (reserve) {
>  			hugetlb_acct_memory(h, -reserve);
> -			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
> +			hugetlb_put_quota(HUGETLBFS_SB(vma->vm_file->f_mapping->host->i_sb), reserve);
>  		}
>  	}
>  }
> @@ -2884,7 +2884,7 @@ int hugetlb_reserve_pages(struct inode *
>  		return chg;
> 
>  	/* There must be enough filesystem quota for the mapping */
> -	if (hugetlb_get_quota(inode->i_mapping, chg))
> +	if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg))
>  		return -ENOSPC;
> 
>  	/*
> @@ -2893,7 +2893,7 @@ int hugetlb_reserve_pages(struct inode *
>  	 */
>  	ret = hugetlb_acct_memory(h, chg);
>  	if (ret < 0) {
> -		hugetlb_put_quota(inode->i_mapping, chg);
> +		hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg);
>  		return ret;
>  	}
> 
> @@ -2922,7 +2922,7 @@ void hugetlb_unreserve_pages(struct inod
>  	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
>  	spin_unlock(&inode->i_lock);
> 
> -	hugetlb_put_quota(inode->i_mapping, (chg - freed));
> +	hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), (chg - freed));
>  	hugetlb_acct_memory(h, -(chg - freed));
>  }
> 
> 
> 
> On 08/15/2011 01:00 PM, Hugh Dickins wrote:
> > On Sat, 13 Aug 2011, David Gibson wrote:
> >> On Fri, Aug 12, 2011 at 12:15:21PM -0700, Hugh Dickins wrote:
> >>>
> >>> Setting that aside, I think this thing of grabbing a reference to inode
> >>> for each page just does not work as you wish: when we unlink an inode,
> >>> all its pages should be freed; but because they are themselves holding
> >>> references to the inode, it and its pages stick around forever.
> >>
> >> Ugh, yes.  You're absolutely right.  That circular reference will mess
> >> everything up.  Thinking it through and testing fail.
> >>
> >>> A quick experiment with your patch versus without confirmed that:
> >>> meminfo HugePages_Free stayed down with your patch, but went back to
> >>> HugePages_Total without it.  Please check, perhaps I'm just mistaken.
> >>>
> >>> Sorry, I've not looked into what a constructive alternative might be;
> >>> and it's not the first time we've had this difficulty - it came up last
> >>> year when the ->freepage function was added, that the inode may be gone
> >>> by the time ->freepage(page) is called.
> >>
> >> Ok, so.  In fact the quota functions we call at free time only need
> >> the super block, not the inode per se.  If we put a superblock pointer
> >> instead of an inode pointer in page private, and refcounted that, I
> >> think that should remove the circular ref.  The only reason I didn't
> >> do it before is that the superblock refcounting functions didn't seem
> >> to be globally visible in an obvious way.
> >>
> >> Does that sound like a reasonable approach?
> > 
> > That does sound closer to a reaonable approach, but my guess is that it
> > will suck you into a world of superblock mutexes and semaphores, which
> > you cannot take at free_huge_page() time.
> > 
> > It might be necessary to hang your own tiny structure off the superblock,
> > with one refcount for the superblock, and one for each hugepage attached,
> > you freeing that structure when the count goes down to zero from either
> > direction.
> > 
> > Whatever you do needs testing with lockdep and atomic sleep checks.
> > 
> > I do dislike tying these separate levels together in such an unusual way,
> > but it is a difficult problem and I don't know of an easy answer.  Maybe
> > we'll need to find a better answer for other reasons, it does come up
> > from time to time e.g. recent race between evicting inode and nrpages
> > going down to 0.
> > 
> > You might care to take a look at how tmpfs (mm/shmem.c) deals with
> > the equivalent issue there (sbinfo->used_blocks).  But I expect you to
> > conclude that hugetlbfs cannot afford the kind of approximations that
> > tmpfs can afford.
> > 
> > Although I think tmpfs is more correct, to be associating the count
> > with pagecache (so the count goes down as soon as a page is truncated
> > or evicted from pagecache), your fewer and huger pages, and reservation
> > conventions, may well demand that the count stays up until the page is
> > actually freed back to hugepool.  And let's not pretend that what tmpfs
> > does is wonderful: the strange shmem_recalc_inode() tries its best to
> > notice when memory pressure has freed clean pages, but it never looks
> > beyond the inode being accessed at the times it's called.  Not at all
> > satisfactory, but not actually an issue in practice, since we stopped
> > allocating pages for simple reads from sparse file.  I did want to
> > convert tmpfs to use ->freepage(), but couldn't manage it without
> > stable mapping - same problem as you have.
> > 
> > Hugh
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH] mm/hugepages: Fix race between hugetlbfs umount and quota update.
  2011-08-16  3:47             ` David Gibson
@ 2011-08-16 17:45               ` Andrew Barry
  2011-08-16 17:45               ` Fix refcounting in hugetlbfs quota handling Andrew Barry
  1 sibling, 0 replies; 11+ messages in thread
From: Andrew Barry @ 2011-08-16 17:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Avi Kivity, Marcelo Tosatti, Jan Kiszka,
	qemu-devel, agraf, kvm, Paul Mackerras, KOSAKI Motohiro,
	Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman, Andrew Hastings,
	Adam Litke, Minchan Kim

Discussion was titled:  Fix refcounting in hugetlbfs quota handling

This patch fixes a race between the umount of a hugetlbfs filesystem, and quota
updates in that filesystem, which can result in the update of the filesystem
quota record, after the record structure has been freed.

Rather than an address-space struct pointer, it puts a hugetlbfs_sb_info struct
pointer into page_private of the page struct. A reference count and an active
bit are added to the hugetlbfs_sb_info struct; the reference count is increased
by hugetlb_get_quota and decreased by hugetlb_put_quota. When hugetlbfs is
unmounted, it frees the hugetlbfs_sb_info struct, but only if the reference
count is zero, otherwise it clears the active bit. The last hugetlb_put_quota
then frees the hugetlbfs_sb_info struct.

Signed-off-by: Andrew Barry <abarry@cray.com>

---

 fs/hugetlbfs/inode.c    |   40 ++++++++++++++++++++++++++--------------
 include/linux/hugetlb.h |    9 +++++++--
 mm/hugetlb.c            |   22 +++++++++++-----------
 3 files changed, 44 insertions(+), 27 deletions(-)


diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 87b6e04..2ed1cca 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -615,8 +615,12 @@ static void hugetlbfs_put_super(struct s
 	struct hugetlbfs_sb_info *sbi = HUGETLBFS_SB(sb);

 	if (sbi) {
+		sbi->active = HPAGE_INACTIVE;
 		sb->s_fs_info = NULL;
-		kfree(sbi);
+
+		/*Free only if used quota is zero. */
+		if (sbi->used_blocks == 0)
+			kfree(sbi);
 	}
 }

@@ -851,6 +855,8 @@ hugetlbfs_fill_super(struct super_block
 	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
+	sbinfo->used_blocks = 0;
+	sbinfo->active = HPAGE_ACTIVE;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = huge_page_size(config.hstate);
 	sb->s_blocksize_bits = huge_page_shift(config.hstate);

@@ -874,30 +880,36 @@ out_free:
 	return -ENOMEM;
 }

-int hugetlb_get_quota(struct address_space *mapping, long delta)
+int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
 {
 	int ret = 0;
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
-		if (sbinfo->free_blocks - delta >= 0)
+	spin_lock(&sbinfo->stat_lock);
+	if ((sbinfo->free_blocks == -1) || (sbinfo->free_blocks - delta >= 0)) {
+		if (sbinfo->free_blocks != -1)
 			sbinfo->free_blocks -= delta;
-		else
-			ret = -ENOMEM;
-		spin_unlock(&sbinfo->stat_lock);
+		sbinfo->used_blocks += delta;
+		sbinfo->active = HPAGE_ACTIVE;
+	} else {
+		ret = -ENOMEM;
 	}
+	spin_unlock(&sbinfo->stat_lock);

 	return ret;
 }

-void hugetlb_put_quota(struct address_space *mapping, long delta)
+void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta)
 {
-	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
-
-	if (sbinfo->free_blocks > -1) {
-		spin_lock(&sbinfo->stat_lock);
+	spin_lock(&sbinfo->stat_lock);
+	if (sbinfo->free_blocks > -1)
 		sbinfo->free_blocks += delta;
+	sbinfo->used_blocks -= delta;
+	/* If hugetlbfs_put_super couldn't free sbinfo due to
+	* an outstanding quota reference, free it now. */
+	if ((sbinfo->used_blocks == 0) && (sbinfo->active == HPAGE_INACTIVE)) {
+		spin_unlock(&sbinfo->stat_lock);
+		kfree(sbinfo);
+	} else {
 		spin_unlock(&sbinfo->stat_lock);
 	}
 }


diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 19644e0..8780a91 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -142,11 +142,16 @@ struct hugetlbfs_config {
 	struct hstate *hstate;
 };

+#define HPAGE_INACTIVE  0
+#define HPAGE_ACTIVE    1
+
 struct hugetlbfs_sb_info {
 	long	max_blocks;   /* blocks allowed */
 	long	free_blocks;  /* blocks free */
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
+	long	used_blocks;  /* blocks used */
+	long	active;		  /* active bit */
 	spinlock_t	stat_lock;
 	struct hstate *hstate;
 };

@@ -171,8 +176,8 @@ extern const struct file_operations huge
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags);
-int hugetlb_get_quota(struct address_space *mapping, long delta);
-void hugetlb_put_quota(struct address_space *mapping, long delta);
+int hugetlb_get_quota(struct hugetlbfs_sb_info *sbinfo, long delta);
+void hugetlb_put_quota(struct hugetlbfs_sb_info *sbinfo, long delta);

 static inline int is_file_hugepages(struct file *file)
 {


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dae27ba..cf26ae9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -533,9 +533,9 @@ static void free_huge_page(struct page *
 	 */
 	struct hstate *h = page_hstate(page);
 	int nid = page_to_nid(page);
-	struct address_space *mapping;
+	struct hugetlbfs_sb_info *sbinfo;

-	mapping = (struct address_space *) page_private(page);
+	sbinfo = ( struct hugetlbfs_sb_info *) page_private(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	BUG_ON(page_count(page));

@@ -551,8 +551,8 @@ static void free_huge_page(struct page *
 		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
-	if (mapping)
-		hugetlb_put_quota(mapping, 1);
+	if (sbinfo)
+		hugetlb_put_quota(sbinfo, 1);
 }

 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)

@@ -1035,7 +1035,7 @@ static struct page *alloc_huge_page(stru
 	if (chg < 0)
 		return ERR_PTR(-VM_FAULT_OOM);
 	if (chg)
-		if (hugetlb_get_quota(inode->i_mapping, chg))
+		if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg))
 			return ERR_PTR(-VM_FAULT_SIGBUS);

 	spin_lock(&hugetlb_lock);

@@ -1045,12 +1045,12 @@ static struct page *alloc_huge_page(stru
 	if (!page) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
-			hugetlb_put_quota(inode->i_mapping, chg);
+			hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
 		}
 	}

-	set_page_private(page, (unsigned long) mapping);
+	set_page_private(page, (unsigned long)
HUGETLBFS_SB(inode->i_mapping->host->i_sb));

 	vma_commit_reservation(h, vma, addr);

@@ -2086,7 +2086,7 @@ static void hugetlb_vm_op_close(struct v

 		if (reserve) {
 			hugetlb_acct_memory(h, -reserve);
-			hugetlb_put_quota(vma->vm_file->f_mapping, reserve);
+			hugetlb_put_quota(HUGETLBFS_SB(vma->vm_file->f_mapping->host->i_sb), reserve);
 		}
 	}
 }

@@ -2884,7 +2884,7 @@ int hugetlb_reserve_pages(struct inode *
 		return chg;

 	/* There must be enough filesystem quota for the mapping */
-	if (hugetlb_get_quota(inode->i_mapping, chg))
+	if (hugetlb_get_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg))
 		return -ENOSPC;

 	/*

@@ -2893,7 +2893,7 @@ int hugetlb_reserve_pages(struct inode *
 	 */
 	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
-		hugetlb_put_quota(inode->i_mapping, chg);
+		hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), chg);
 		return ret;
 	}

@@ -2922,7 +2922,7 @@ void hugetlb_unreserve_pages(struct inod
 	inode->i_blocks -= (blocks_per_huge_page(h) * freed);
 	spin_unlock(&inode->i_lock);

-	hugetlb_put_quota(inode->i_mapping, (chg - freed));
+	hugetlb_put_quota(HUGETLBFS_SB(inode->i_mapping->host->i_sb), (chg - freed));
 	hugetlb_acct_memory(h, -(chg - freed));
 }


On 08/15/2011 10:47 PM, David Gibson wrote:
> On Mon, Aug 15, 2011 at 03:25:35PM -0500, Andrew Barry wrote:
>> I've been doing something similar to this last proposal. I put a
>> hugetlbfs_sb_info pointer into page_private, and dropped a reference counter and
>> an active/inactive bit into the hugetlbfs_sb_info struct. At Umount time, the
>> sbinfo is freed, only if the reference count is zero. Otherwise, the last
>> put_quota frees the sbinfo structure. This fixed the race we were seeing between
>> umount and a put_quota from an rdma transaction. I just gave it a cursory test
>> on a 3.0 kernel; it has seen quite a lot more testing on a 2.6.32-derived
>> kernel, with no more hits of the umount race.
>>
>> Does this address the problems you were thinking about?
> 
> Ah, this looks much better than my patch.  And the fact that you've
> seen your race demonstrates clearly that this isn't just a kvm
> problem.  I hope we can push this upstream very soon - what can I do
> to help?
> 
>> -Andrew Barry
>>




^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Fix refcounting in hugetlbfs quota handling
  2011-08-16  3:47             ` David Gibson
  2011-08-16 17:45               ` [RFC PATCH] mm/hugepages: Fix race between hugetlbfs umount and quota update Andrew Barry
@ 2011-08-16 17:45               ` Andrew Barry
  1 sibling, 0 replies; 11+ messages in thread
From: Andrew Barry @ 2011-08-16 17:45 UTC (permalink / raw)
  To: Hugh Dickins, Avi Kivity, Marcelo Tosatti, Jan Kiszka,
	qemu-devel, agraf, kvm, Paul Mackerras, linux-kernel,
	KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman,
	Andrew Hastings, Adam Litke, Minchan Kim

It is not a unique KVM problem. We saw the race while doing large async rDMA in
our network driver, but I can imagine it happening with a slow NFS server, or
other DMA that could complete after umount.

What I need, in order to push this upstream, is:
1. For you to light a fire under my feet to get this going. (done)
2. For you all to review the patch and let me know if I've missed anything.
Patch RFC coming ASAP.

-Andrew Barry


On 08/15/2011 10:47 PM, David Gibson wrote:
> On Mon, Aug 15, 2011 at 03:25:35PM -0500, Andrew Barry wrote:
>> I've been doing something similar to this last proposal. I put a
>> hugetlbfs_sb_info pointer into page_private, and dropped a reference counter and
>> an active/inactive bit into the hugetlbfs_sb_info struct. At Umount time, the
>> sbinfo is freed, only if the reference count is zero. Otherwise, the last
>> put_quota frees the sbinfo structure. This fixed the race we were seeing between
>> umount and a put_quota from an rdma transaction. I just gave it a cursory test
>> on a 3.0 kernel; it has seen quite a lot more testing on a 2.6.32-derived
>> kernel, with no more hits of the umount race.
>>
>> Does this address the problems you were thinking about?
> 
> Ah, this looks much better than my patch.  And the fact that you've
> seen your race demonstrates clearly that this isn't just a kvm
> problem.  I hope we can push this upstream very soon - what can I do
> to help?
> 
>> -Andrew Barry
>>

>>
>> On 08/15/2011 01:00 PM, Hugh Dickins wrote:
>>> On Sat, 13 Aug 2011, David Gibson wrote:
>>>> On Fri, Aug 12, 2011 at 12:15:21PM -0700, Hugh Dickins wrote:
>>>>>
>>>>> Setting that aside, I think this thing of grabbing a reference to inode
>>>>> for each page just does not work as you wish: when we unlink an inode,
>>>>> all its pages should be freed; but because they are themselves holding
>>>>> references to the inode, it and its pages stick around forever.
>>>>
>>>> Ugh, yes.  You're absolutely right.  That circular reference will mess
>>>> everything up.  Thinking it through and testing fail.
>>>>
>>>>> A quick experiment with your patch versus without confirmed that:
>>>>> meminfo HugePages_Free stayed down with your patch, but went back to
>>>>> HugePages_Total without it.  Please check, perhaps I'm just mistaken.
>>>>>
>>>>> Sorry, I've not looked into what a constructive alternative might be;
>>>>> and it's not the first time we've had this difficulty - it came up last
>>>>> year when the ->freepage function was added, that the inode may be gone
>>>>> by the time ->freepage(page) is called.
>>>>
>>>> Ok, so.  In fact the quota functions we call at free time only need
>>>> the super block, not the inode per se.  If we put a superblock pointer
>>>> instead of an inode pointer in page private, and refcounted that, I
>>>> think that should remove the circular ref.  The only reason I didn't
>>>> do it before is that the superblock refcounting functions didn't seem
>>>> to be globally visible in an obvious way.
>>>>
>>>> Does that sound like a reasonable approach?
>>>
>>> That does sound closer to a reaonable approach, but my guess is that it
>>> will suck you into a world of superblock mutexes and semaphores, which
>>> you cannot take at free_huge_page() time.
>>>
>>> It might be necessary to hang your own tiny structure off the superblock,
>>> with one refcount for the superblock, and one for each hugepage attached,
>>> you freeing that structure when the count goes down to zero from either
>>> direction.
>>>
>>> Whatever you do needs testing with lockdep and atomic sleep checks.
>>>
>>> I do dislike tying these separate levels together in such an unusual way,
>>> but it is a difficult problem and I don't know of an easy answer.  Maybe
>>> we'll need to find a better answer for other reasons, it does come up
>>> from time to time e.g. recent race between evicting inode and nrpages
>>> going down to 0.
>>>
>>> You might care to take a look at how tmpfs (mm/shmem.c) deals with
>>> the equivalent issue there (sbinfo->used_blocks).  But I expect you to
>>> conclude that hugetlbfs cannot afford the kind of approximations that
>>> tmpfs can afford.
>>>
>>> Although I think tmpfs is more correct, to be associating the count
>>> with pagecache (so the count goes down as soon as a page is truncated
>>> or evicted from pagecache), your fewer and huger pages, and reservation
>>> conventions, may well demand that the count stays up until the page is
>>> actually freed back to hugepool.  And let's not pretend that what tmpfs
>>> does is wonderful: the strange shmem_recalc_inode() tries its best to
>>> notice when memory pressure has freed clean pages, but it never looks
>>> beyond the inode being accessed at the times it's called.  Not at all
>>> satisfactory, but not actually an issue in practice, since we stopped
>>> allocating pages for simple reads from sparse file.  I did want to
>>> convert tmpfs to use ->freepage(), but couldn't manage it without
>>> stable mapping - same problem as you have.
>>>
>>> Hugh
>>
> 
> --
> David Gibson                    | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
>                                 | _way_ _around_!
> http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-08-16 17:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-11  6:40 Fix refcounting in hugetlbfs quota handling David Gibson
2011-08-12  0:48 ` Linus Torvalds
2011-08-12  4:34   ` Minchan Kim
2011-08-12 19:15     ` Hugh Dickins
2011-08-13  1:08       ` David Gibson
2011-08-15 18:00         ` Hugh Dickins
2011-08-15 20:25           ` Andrew Barry
2011-08-16  3:47             ` David Gibson
2011-08-16 17:45               ` [RFC PATCH] mm/hugepages: Fix race between hugetlbfs umount and quota update Andrew Barry
2011-08-16 17:45               ` Fix refcounting in hugetlbfs quota handling Andrew Barry
2011-08-12 22:20 ` [Qemu-devel] " Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).