[1/6] mm: khugepaged: fix radix tree node leak in shmem collapse error path
diff mbox series

Message ID 20161107190741.3619-2-hannes@cmpxchg.org
State New, archived
Headers show
Series
  • mm: workingset: radix tree subtleties & single-page file refaults
Related show

Commit Message

Johannes Weiner Nov. 7, 2016, 7:07 p.m. UTC
The radix tree counts valid entries in each tree node. Entries stored
in the tree cannot be removed by simpling storing NULL in the slot or
the internal counters will be off and the node never gets freed again.

When collapsing a shmem page fails, restore the holes that were filled
with radix_tree_insert() with a proper radix tree deletion.

Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/khugepaged.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jan Kara Nov. 8, 2016, 9:53 a.m. UTC | #1
On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> The radix tree counts valid entries in each tree node. Entries stored
> in the tree cannot be removed by simpling storing NULL in the slot or
> the internal counters will be off and the node never gets freed again.
> 
> When collapsing a shmem page fails, restore the holes that were filled
> with radix_tree_insert() with a proper radix tree deletion.
> 
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Reported-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/khugepaged.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 728d7790dc2d..eac6f0580e26 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
>  				if (!nr_none)
>  					break;
>  				/* Put holes back where they were */
> -				radix_tree_replace_slot(slot, NULL);
> +				radix_tree_delete(&mapping->page_tree,
> +						  iter.index);

Hum, but this is inside radix_tree_for_each_slot() iteration. And
radix_tree_delete() may end up freeing nodes resulting in invalidating
current slot pointer and the iteration code will do use-after-free.

								Honza
Johannes Weiner Nov. 8, 2016, 4:12 p.m. UTC | #2
On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > The radix tree counts valid entries in each tree node. Entries stored
> > in the tree cannot be removed by simpling storing NULL in the slot or
> > the internal counters will be off and the node never gets freed again.
> > 
> > When collapsing a shmem page fails, restore the holes that were filled
> > with radix_tree_insert() with a proper radix tree deletion.
> > 
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Reported-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/khugepaged.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 728d7790dc2d..eac6f0580e26 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> >  				if (!nr_none)
> >  					break;
> >  				/* Put holes back where they were */
> > -				radix_tree_replace_slot(slot, NULL);
> > +				radix_tree_delete(&mapping->page_tree,
> > +						  iter.index);
> 
> Hum, but this is inside radix_tree_for_each_slot() iteration. And
> radix_tree_delete() may end up freeing nodes resulting in invalidating
> current slot pointer and the iteration code will do use-after-free.

Good point, we need to do another tree lookup after the deletion.

But there are other instances in the code, where we drop the lock
temporarily and somebody else could delete the node from under us.

In the main collapse path, I *think* this is prevented by the fact
that when we drop the tree lock we still hold the page lock of the
regular page that's in the tree while we isolate and unmap it, thus
pin the node. Even so, it would seem a little hairy to rely on that.

Kirill?

I'll update this patch and prepend another fix to the series that
addresses the other two lock dropping issues.

Thanks Jan.

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fed8d5e96978..1e43e77a98da 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1424,6 +1424,7 @@ static void collapse_shmem(struct mm_struct *mm,
 		radix_tree_replace_slot(&mapping->page_tree, slot,
 				new_page + (index % HPAGE_PMD_NR));
 
+		slot = radix_tree_iter_next(&iter);
 		index++;
 		continue;
 out_lru:
@@ -1522,6 +1523,7 @@ static void collapse_shmem(struct mm_struct *mm,
 				/* Put holes back where they were */
 				radix_tree_delete(&mapping->page_tree,
 						  iter.index);
+				slot = radix_tree_iter_next(&iter);
 				nr_none--;
 				continue;
 			}
@@ -1537,6 +1539,7 @@ static void collapse_shmem(struct mm_struct *mm,
 			putback_lru_page(page);
 			unlock_page(page);
 			spin_lock_irq(&mapping->tree_lock);
+			slot = radix_tree_iter_next(&iter);
 		}
 		VM_BUG_ON(nr_none);
 		spin_unlock_irq(&mapping->tree_lock);
Jan Kara Nov. 9, 2016, 7:41 a.m. UTC | #3
On Tue 08-11-16 11:12:45, Johannes Weiner wrote:
> On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > The radix tree counts valid entries in each tree node. Entries stored
> > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > the internal counters will be off and the node never gets freed again.
> > > 
> > > When collapsing a shmem page fails, restore the holes that were filled
> > > with radix_tree_insert() with a proper radix tree deletion.
> > > 
> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > Reported-by: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/khugepaged.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 728d7790dc2d..eac6f0580e26 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > >  				if (!nr_none)
> > >  					break;
> > >  				/* Put holes back where they were */
> > > -				radix_tree_replace_slot(slot, NULL);
> > > +				radix_tree_delete(&mapping->page_tree,
> > > +						  iter.index);
> > 
> > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > current slot pointer and the iteration code will do use-after-free.
> 
> Good point, we need to do another tree lookup after the deletion.
> 
> But there are other instances in the code, where we drop the lock
> temporarily and somebody else could delete the node from under us.
> 
> In the main collapse path, I *think* this is prevented by the fact
> that when we drop the tree lock we still hold the page lock of the
> regular page that's in the tree while we isolate and unmap it, thus
> pin the node. Even so, it would seem a little hairy to rely on that.

Yeah, I think that is mostly right but I'm not sure whether shrinking of
radix tree into direct pointer cannot bite us here as well. Generally that
relies on internal implementatation of the radix tree and its iterator
so what you did makes sense to me.

								Honza
Kirill A. Shutemov Nov. 11, 2016, 10:59 a.m. UTC | #4
On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > The radix tree counts valid entries in each tree node. Entries stored
> > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > the internal counters will be off and the node never gets freed again.
> > > 
> > > When collapsing a shmem page fails, restore the holes that were filled
> > > with radix_tree_insert() with a proper radix tree deletion.
> > > 
> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > Reported-by: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/khugepaged.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 728d7790dc2d..eac6f0580e26 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > >  				if (!nr_none)
> > >  					break;
> > >  				/* Put holes back where they were */
> > > -				radix_tree_replace_slot(slot, NULL);
> > > +				radix_tree_delete(&mapping->page_tree,
> > > +						  iter.index);
> > 
> > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > current slot pointer and the iteration code will do use-after-free.
> 
> Good point, we need to do another tree lookup after the deletion.
> 
> But there are other instances in the code, where we drop the lock
> temporarily and somebody else could delete the node from under us.
> 
> In the main collapse path, I *think* this is prevented by the fact
> that when we drop the tree lock we still hold the page lock of the
> regular page that's in the tree while we isolate and unmap it, thus
> pin the node. Even so, it would seem a little hairy to rely on that.
> 
> Kirill?

[ sorry for delay ]

Yes, we make sure that locked page still belong to the radix tree and fall
off if it's not. Locked page cannot be removed from radix-tree, so we
should be fine.

> I'll update this patch and prepend another fix to the series that
> addresses the other two lock dropping issues.

Feel free add my Acked-by.
Jan Kara Nov. 11, 2016, 12:22 p.m. UTC | #5
On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > the internal counters will be off and the node never gets freed again.
> > > > 
> > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > 
> > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > Reported-by: Jan Kara <jack@suse.cz>
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > ---
> > > >  mm/khugepaged.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > >  				if (!nr_none)
> > > >  					break;
> > > >  				/* Put holes back where they were */
> > > > -				radix_tree_replace_slot(slot, NULL);
> > > > +				radix_tree_delete(&mapping->page_tree,
> > > > +						  iter.index);
> > > 
> > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > current slot pointer and the iteration code will do use-after-free.
> > 
> > Good point, we need to do another tree lookup after the deletion.
> > 
> > But there are other instances in the code, where we drop the lock
> > temporarily and somebody else could delete the node from under us.
> > 
> > In the main collapse path, I *think* this is prevented by the fact
> > that when we drop the tree lock we still hold the page lock of the
> > regular page that's in the tree while we isolate and unmap it, thus
> > pin the node. Even so, it would seem a little hairy to rely on that.
> > 
> > Kirill?
> 
> [ sorry for delay ]
> 
> Yes, we make sure that locked page still belong to the radix tree and fall
> off if it's not. Locked page cannot be removed from radix-tree, so we
> should be fine.

Well, it cannot be removed from the radix tree but radix tree code is still
free to collapse / expand the tree nodes as it sees fit (currently the only
real case is when changing direct page pointer in the tree root to a node
pointer or vice versa but still...). So code should not really assume that
the node page is referenced from does not change once tree_lock is dropped.
It leads to subtle bugs...

								Honza
Kirill A. Shutemov Nov. 11, 2016, 4:37 p.m. UTC | #6
On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > the internal counters will be off and the node never gets freed again.
> > > > > 
> > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > > 
> > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > Reported-by: Jan Kara <jack@suse.cz>
> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > ---
> > > > >  mm/khugepaged.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > >  				if (!nr_none)
> > > > >  					break;
> > > > >  				/* Put holes back where they were */
> > > > > -				radix_tree_replace_slot(slot, NULL);
> > > > > +				radix_tree_delete(&mapping->page_tree,
> > > > > +						  iter.index);
> > > > 
> > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > current slot pointer and the iteration code will do use-after-free.
> > > 
> > > Good point, we need to do another tree lookup after the deletion.
> > > 
> > > But there are other instances in the code, where we drop the lock
> > > temporarily and somebody else could delete the node from under us.
> > > 
> > > In the main collapse path, I *think* this is prevented by the fact
> > > that when we drop the tree lock we still hold the page lock of the
> > > regular page that's in the tree while we isolate and unmap it, thus
> > > pin the node. Even so, it would seem a little hairy to rely on that.
> > > 
> > > Kirill?
> > 
> > [ sorry for delay ]
> > 
> > Yes, we make sure that locked page still belong to the radix tree and fall
> > off if it's not. Locked page cannot be removed from radix-tree, so we
> > should be fine.
> 
> Well, it cannot be removed from the radix tree but radix tree code is still
> free to collapse / expand the tree nodes as it sees fit (currently the only
> real case is when changing direct page pointer in the tree root to a node
> pointer or vice versa but still...). So code should not really assume that
> the node page is referenced from does not change once tree_lock is dropped.
> It leads to subtle bugs...

Hm. Okay.

What is the right way re-validate that slot is still valid? Do I need full
look up again? Can I pin node explicitly?
Jan Kara Nov. 14, 2016, 8:07 a.m. UTC | #7
On Fri 11-11-16 19:37:53, Kirill A. Shutemov wrote:
> On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> > On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > > the internal counters will be off and the node never gets freed again.
> > > > > > 
> > > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > > > 
> > > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > > Reported-by: Jan Kara <jack@suse.cz>
> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > ---
> > > > > >  mm/khugepaged.c | 3 ++-
> > > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > > >  				if (!nr_none)
> > > > > >  					break;
> > > > > >  				/* Put holes back where they were */
> > > > > > -				radix_tree_replace_slot(slot, NULL);
> > > > > > +				radix_tree_delete(&mapping->page_tree,
> > > > > > +						  iter.index);
> > > > > 
> > > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > > current slot pointer and the iteration code will do use-after-free.
> > > > 
> > > > Good point, we need to do another tree lookup after the deletion.
> > > > 
> > > > But there are other instances in the code, where we drop the lock
> > > > temporarily and somebody else could delete the node from under us.
> > > > 
> > > > In the main collapse path, I *think* this is prevented by the fact
> > > > that when we drop the tree lock we still hold the page lock of the
> > > > regular page that's in the tree while we isolate and unmap it, thus
> > > > pin the node. Even so, it would seem a little hairy to rely on that.
> > > > 
> > > > Kirill?
> > > 
> > > [ sorry for delay ]
> > > 
> > > Yes, we make sure that locked page still belong to the radix tree and fall
> > > off if it's not. Locked page cannot be removed from radix-tree, so we
> > > should be fine.
> > 
> > Well, it cannot be removed from the radix tree but radix tree code is still
> > free to collapse / expand the tree nodes as it sees fit (currently the only
> > real case is when changing direct page pointer in the tree root to a node
> > pointer or vice versa but still...). So code should not really assume that
> > the node page is referenced from does not change once tree_lock is dropped.
> > It leads to subtle bugs...
> 
> Hm. Okay.
> 
> What is the right way re-validate that slot is still valid? Do I need full
> look up again? Can I pin node explicitly?

Full lookup is the only way to re-validate the slot. There is no way to pin
a radix tree node.

									Honza
Kirill A. Shutemov Nov. 14, 2016, 2:29 p.m. UTC | #8
On Mon, Nov 14, 2016 at 09:07:44AM +0100, Jan Kara wrote:
> On Fri 11-11-16 19:37:53, Kirill A. Shutemov wrote:
> > On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> > > On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > > > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > > > the internal counters will be off and the node never gets freed again.
> > > > > > > 
> > > > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > > > > 
> > > > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > > > Reported-by: Jan Kara <jack@suse.cz>
> > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > ---
> > > > > > >  mm/khugepaged.c | 3 ++-
> > > > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > > > --- a/mm/khugepaged.c
> > > > > > > +++ b/mm/khugepaged.c
> > > > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > > > >  				if (!nr_none)
> > > > > > >  					break;
> > > > > > >  				/* Put holes back where they were */
> > > > > > > -				radix_tree_replace_slot(slot, NULL);
> > > > > > > +				radix_tree_delete(&mapping->page_tree,
> > > > > > > +						  iter.index);
> > > > > > 
> > > > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > > > current slot pointer and the iteration code will do use-after-free.
> > > > > 
> > > > > Good point, we need to do another tree lookup after the deletion.
> > > > > 
> > > > > But there are other instances in the code, where we drop the lock
> > > > > temporarily and somebody else could delete the node from under us.
> > > > > 
> > > > > In the main collapse path, I *think* this is prevented by the fact
> > > > > that when we drop the tree lock we still hold the page lock of the
> > > > > regular page that's in the tree while we isolate and unmap it, thus
> > > > > pin the node. Even so, it would seem a little hairy to rely on that.
> > > > > 
> > > > > Kirill?
> > > > 
> > > > [ sorry for delay ]
> > > > 
> > > > Yes, we make sure that locked page still belong to the radix tree and fall
> > > > off if it's not. Locked page cannot be removed from radix-tree, so we
> > > > should be fine.
> > > 
> > > Well, it cannot be removed from the radix tree but radix tree code is still
> > > free to collapse / expand the tree nodes as it sees fit (currently the only
> > > real case is when changing direct page pointer in the tree root to a node
> > > pointer or vice versa but still...). So code should not really assume that
> > > the node page is referenced from does not change once tree_lock is dropped.
> > > It leads to subtle bugs...
> > 
> > Hm. Okay.
> > 
> > What is the right way re-validate that slot is still valid? Do I need full
> > look up again? Can I pin node explicitly?
> 
> Full lookup is the only way to re-validate the slot. There is no way to pin
> a radix tree node.

I guess this should be enough:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 728d7790dc2d..c5ef73588676 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
 					PAGE_SIZE, 0);
 
 		spin_lock_irq(&mapping->tree_lock);
-
+		slot = radix_tree_lookup_slot(&mapping->page_tree, index);
+		VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
+					&mapping->tree_lock), page);
 		VM_BUG_ON_PAGE(page_mapped(page), page);
 
 		/*
Johannes Weiner Nov. 14, 2016, 3:52 p.m. UTC | #9
On Mon, Nov 14, 2016 at 05:29:02PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 14, 2016 at 09:07:44AM +0100, Jan Kara wrote:
> > On Fri 11-11-16 19:37:53, Kirill A. Shutemov wrote:
> > > On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> > > > On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > > > > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > > > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > > > > the internal counters will be off and the node never gets freed again.
> > > > > > > > 
> > > > > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > > > > > 
> > > > > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > > > > Reported-by: Jan Kara <jack@suse.cz>
> > > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > > ---
> > > > > > > >  mm/khugepaged.c | 3 ++-
> > > > > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > > > > --- a/mm/khugepaged.c
> > > > > > > > +++ b/mm/khugepaged.c
> > > > > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > > > > >  				if (!nr_none)
> > > > > > > >  					break;
> > > > > > > >  				/* Put holes back where they were */
> > > > > > > > -				radix_tree_replace_slot(slot, NULL);
> > > > > > > > +				radix_tree_delete(&mapping->page_tree,
> > > > > > > > +						  iter.index);
> > > > > > > 
> > > > > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > > > > current slot pointer and the iteration code will do use-after-free.
> > > > > > 
> > > > > > Good point, we need to do another tree lookup after the deletion.
> > > > > > 
> > > > > > But there are other instances in the code, where we drop the lock
> > > > > > temporarily and somebody else could delete the node from under us.
> > > > > > 
> > > > > > In the main collapse path, I *think* this is prevented by the fact
> > > > > > that when we drop the tree lock we still hold the page lock of the
> > > > > > regular page that's in the tree while we isolate and unmap it, thus
> > > > > > pin the node. Even so, it would seem a little hairy to rely on that.
> > > > > > 
> > > > > > Kirill?
> > > > > 
> > > > > [ sorry for delay ]
> > > > > 
> > > > > Yes, we make sure that locked page still belong to the radix tree and fall
> > > > > off if it's not. Locked page cannot be removed from radix-tree, so we
> > > > > should be fine.
> > > > 
> > > > Well, it cannot be removed from the radix tree but radix tree code is still
> > > > free to collapse / expand the tree nodes as it sees fit (currently the only
> > > > real case is when changing direct page pointer in the tree root to a node
> > > > pointer or vice versa but still...). So code should not really assume that
> > > > the node page is referenced from does not change once tree_lock is dropped.
> > > > It leads to subtle bugs...
> > > 
> > > Hm. Okay.
> > > 
> > > What is the right way re-validate that slot is still valid? Do I need full
> > > look up again? Can I pin node explicitly?
> > 
> > Full lookup is the only way to re-validate the slot. There is no way to pin
> > a radix tree node.
> 
> I guess this should be enough:
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 728d7790dc2d..c5ef73588676 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
>  					PAGE_SIZE, 0);
>  
>  		spin_lock_irq(&mapping->tree_lock);
> -
> +		slot = radix_tree_lookup_slot(&mapping->page_tree, index);
> +		VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
> +					&mapping->tree_lock), page);
>  		VM_BUG_ON_PAGE(page_mapped(page), page);

That looks good to me. The slot may get relocated, but the content
shouldn't change with the page locked.

Are you going to send a full patch with changelog and sign-off? If so,
please add:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Johannes Weiner Nov. 14, 2016, 4:48 p.m. UTC | #10
On Mon, Nov 14, 2016 at 10:52:50AM -0500, Johannes Weiner wrote:
> On Mon, Nov 14, 2016 at 05:29:02PM +0300, Kirill A. Shutemov wrote:
> > @@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
> >  					PAGE_SIZE, 0);
> >  
> >  		spin_lock_irq(&mapping->tree_lock);
> > -
> > +		slot = radix_tree_lookup_slot(&mapping->page_tree, index);
> > +		VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
> > +					&mapping->tree_lock), page);
> >  		VM_BUG_ON_PAGE(page_mapped(page), page);
> 
> That looks good to me. The slot may get relocated, but the content
> shouldn't change with the page locked.
> 
> Are you going to send a full patch with changelog and sign-off? If so,
> please add:
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Just to clarify, this is in addition to my radix_tree_iter_next()
change. The iterator still needs to be reloaded because the number of
valid slots that come after the current one can change as well.
Kirill A. Shutemov Nov. 14, 2016, 7:40 p.m. UTC | #11
On Mon, Nov 14, 2016 at 11:48:22AM -0500, Johannes Weiner wrote:
> On Mon, Nov 14, 2016 at 10:52:50AM -0500, Johannes Weiner wrote:
> > On Mon, Nov 14, 2016 at 05:29:02PM +0300, Kirill A. Shutemov wrote:
> > > @@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
> > >  					PAGE_SIZE, 0);
> > >  
> > >  		spin_lock_irq(&mapping->tree_lock);
> > > -
> > > +		slot = radix_tree_lookup_slot(&mapping->page_tree, index);
> > > +		VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
> > > +					&mapping->tree_lock), page);
> > >  		VM_BUG_ON_PAGE(page_mapped(page), page);
> > 
> > That looks good to me. The slot may get relocated, but the content
> > shouldn't change with the page locked.
> > 
> > Are you going to send a full patch with changelog and sign-off? If so,
> > please add:
> > 
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Just to clarify, this is in addition to my radix_tree_iter_next()
> change. The iterator still needs to be reloaded because the number of
> valid slots that come after the current one can change as well.

Could you just amend all these fixups into your patch?
Johannes Weiner Nov. 15, 2016, 2 p.m. UTC | #12
On Mon, Nov 14, 2016 at 10:40:54PM +0300, Kirill A. Shutemov wrote:
> Could you just amend all these fixups into your patch?

Will do.

Patch
diff mbox series

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 728d7790dc2d..eac6f0580e26 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1520,7 +1520,8 @@  static void collapse_shmem(struct mm_struct *mm,
 				if (!nr_none)
 					break;
 				/* Put holes back where they were */
-				radix_tree_replace_slot(slot, NULL);
+				radix_tree_delete(&mapping->page_tree,
+						  iter.index);
 				nr_none--;
 				continue;
 			}