nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* dax_lock_mapping_entry was never safe
@ 2018-11-26 16:12 Matthew Wilcox
  2018-11-26 17:11 ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2018-11-26 16:12 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, Jan Kara, linux-nvdimm


I noticed this path while I was doing the 4.19 backport of
dax: Avoid losing wakeup in dax_lock_mapping_entry

                xa_unlock_irq(&mapping->i_pages);
                revalidate = wait_fn();
                finish_wait(wq, &ewait.wait);
                xa_lock_irq(&mapping->i_pages);

It's not safe to call xa_lock_irq() if mapping can have been freed while
we slept.  We'll probably get away with it; most filesystems use a unique
slab for their inodes, so you'll likely get either a freed inode or an
inode which is now the wrong inode.  But if that page has been freed back
to the page allocator, that pointer could now be pointing at anything.

Fixing this in the current codebase is no easier than fixing it in the
4.19 codebase.  This is the best I've come up with.  Could we do better
by not using the _exclusive form of prepare_to_wait()?  I'm not familiar
with all the things that need to be considered when using this family
of interfaces.

diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..154b592b18eb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -232,6 +232,24 @@ static void *get_unlocked_entry(struct xa_state *xas)
 	}
 }
 
+static void wait_unlocked_entry(struct xa_state *xas, void *entry)
+{
+	struct wait_exceptional_entry_queue ewait;
+	wait_queue_head_t *wq;
+
+	init_wait(&ewait.wait);
+	ewait.wait.func = wake_exceptional_entry_func;
+
+	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+	prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
+	xas_unlock_irq(xas);
+	/* We can no longer look at xas */
+	schedule();
+	finish_wait(wq, &ewait.wait);
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, 1, &ewait.key);
+}
+
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
@@ -389,9 +407,7 @@ bool dax_lock_mapping_entry(struct page *page)
 		entry = xas_load(&xas);
 		if (dax_is_locked(entry)) {
 			rcu_read_unlock();
-			entry = get_unlocked_entry(&xas);
-			xas_unlock_irq(&xas);
-			put_unlocked_entry(&xas, entry);
+			wait_unlocked_entry(&xas, entry);
 			rcu_read_lock();
 			continue;
 		}
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: dax_lock_mapping_entry was never safe
  2018-11-26 16:12 dax_lock_mapping_entry was never safe Matthew Wilcox
@ 2018-11-26 17:11 ` Jan Kara
  2018-11-26 20:36   ` Dan Williams
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2018-11-26 17:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Jan Kara, linux-nvdimm

On Mon 26-11-18 08:12:40, Matthew Wilcox wrote:
> 
> I noticed this path while I was doing the 4.19 backport of
> dax: Avoid losing wakeup in dax_lock_mapping_entry
> 
>                 xa_unlock_irq(&mapping->i_pages);
>                 revalidate = wait_fn();
>                 finish_wait(wq, &ewait.wait);
>                 xa_lock_irq(&mapping->i_pages);

I guess this is a snippet from get_unlocked_entry(), isn't it?

> It's not safe to call xa_lock_irq() if mapping can have been freed while
> we slept.  We'll probably get away with it; most filesystems use a unique
> slab for their inodes, so you'll likely get either a freed inode or an
> inode which is now the wrong inode.  But if that page has been freed back
> to the page allocator, that pointer could now be pointing at anything.

Correct. Thanks for catching this bug!

> Fixing this in the current codebase is no easier than fixing it in the
> 4.19 codebase.  This is the best I've come up with.  Could we do better
> by not using the _exclusive form of prepare_to_wait()?  I'm not familiar
> with all the things that need to be considered when using this family
> of interfaces.
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 9bcce89ea18e..154b592b18eb 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -232,6 +232,24 @@ static void *get_unlocked_entry(struct xa_state *xas)
>  	}
>  }
>  
> +static void wait_unlocked_entry(struct xa_state *xas, void *entry)
> +{
> +	struct wait_exceptional_entry_queue ewait;
> +	wait_queue_head_t *wq;
> +
> +	init_wait(&ewait.wait);
> +	ewait.wait.func = wake_exceptional_entry_func;
> +
> +	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
> +	prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
> +	xas_unlock_irq(xas);
> +	/* We can no longer look at xas */
> +	schedule();
> +	finish_wait(wq, &ewait.wait);
> +	if (waitqueue_active(wq))
> +		__wake_up(wq, TASK_NORMAL, 1, &ewait.key);
> +}
> +

The code looks good. Maybe can we call this wait_entry_unlocked() to stress
that entry is not really usable after this function returns? And comment
before the function that this is safe to call even if we don't have a
reference keeping mapping alive?

>  static void put_unlocked_entry(struct xa_state *xas, void *entry)
>  {
>  	/* If we were the only waiter woken, wake the next one */
> @@ -389,9 +407,7 @@ bool dax_lock_mapping_entry(struct page *page)
>  		entry = xas_load(&xas);
>  		if (dax_is_locked(entry)) {
>  			rcu_read_unlock();
> -			entry = get_unlocked_entry(&xas);
> -			xas_unlock_irq(&xas);
> -			put_unlocked_entry(&xas, entry);
> +			wait_unlocked_entry(&xas, entry);
>  			rcu_read_lock();
>  			continue;

The continue here actually is not safe either because if the mapping got
freed, page->mapping will be NULL and we oops at the beginning of the loop.
So that !dax_mapping() check should also check for mapping != NULL.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: dax_lock_mapping_entry was never safe
  2018-11-26 17:11 ` Jan Kara
@ 2018-11-26 20:36   ` Dan Williams
  2018-11-27 18:59     ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Williams @ 2018-11-26 20:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Wilcox, linux-fsdevel, linux-nvdimm, Dave Jiang

On Mon, Nov 26, 2018 at 9:11 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 26-11-18 08:12:40, Matthew Wilcox wrote:
> >
> > I noticed this path while I was doing the 4.19 backport of
> > dax: Avoid losing wakeup in dax_lock_mapping_entry
> >
> >                 xa_unlock_irq(&mapping->i_pages);
> >                 revalidate = wait_fn();
> >                 finish_wait(wq, &ewait.wait);
> >                 xa_lock_irq(&mapping->i_pages);
>
> I guess this is a snippet from get_unlocked_entry(), isn't it?
>
> > It's not safe to call xa_lock_irq() if mapping can have been freed while
> > we slept.  We'll probably get away with it; most filesystems use a unique
> > slab for their inodes, so you'll likely get either a freed inode or an
> > inode which is now the wrong inode.  But if that page has been freed back
> > to the page allocator, that pointer could now be pointing at anything.
>
> Correct. Thanks for catching this bug!

Yes, nice catch!

>
> > Fixing this in the current codebase is no easier than fixing it in the
> > 4.19 codebase.  This is the best I've come up with.  Could we do better
> > by not using the _exclusive form of prepare_to_wait()?  I'm not familiar
> > with all the things that need to be considered when using this family
> > of interfaces.
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9bcce89ea18e..154b592b18eb 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -232,6 +232,24 @@ static void *get_unlocked_entry(struct xa_state *xas)
> >       }
> >  }
> >
> > +static void wait_unlocked_entry(struct xa_state *xas, void *entry)
> > +{
> > +     struct wait_exceptional_entry_queue ewait;
> > +     wait_queue_head_t *wq;
> > +
> > +     init_wait(&ewait.wait);
> > +     ewait.wait.func = wake_exceptional_entry_func;
> > +
> > +     wq = dax_entry_waitqueue(xas, entry, &ewait.key);
> > +     prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
> > +     xas_unlock_irq(xas);
> > +     /* We can no longer look at xas */
> > +     schedule();
> > +     finish_wait(wq, &ewait.wait);
> > +     if (waitqueue_active(wq))
> > +             __wake_up(wq, TASK_NORMAL, 1, &ewait.key);
> > +}
> > +
>
> The code looks good. Maybe can we call this wait_entry_unlocked() to stress
> that entry is not really usable after this function returns? And comment
> before the function that this is safe to call even if we don't have a
> reference keeping mapping alive?

Yes, maybe even something more ambiguous like "wait_entry_event()",
because there's no guarantee the entry is unlocked just that now is a
good time to try to interrogate the entry again.

>
> >  static void put_unlocked_entry(struct xa_state *xas, void *entry)
> >  {
> >       /* If we were the only waiter woken, wake the next one */
> > @@ -389,9 +407,7 @@ bool dax_lock_mapping_entry(struct page *page)
> >               entry = xas_load(&xas);
> >               if (dax_is_locked(entry)) {
> >                       rcu_read_unlock();
> > -                     entry = get_unlocked_entry(&xas);
> > -                     xas_unlock_irq(&xas);
> > -                     put_unlocked_entry(&xas, entry);
> > +                     wait_unlocked_entry(&xas, entry);
> >                       rcu_read_lock();
> >                       continue;
>
> The continue here actually is not safe either because if the mapping got
> freed, page->mapping will be NULL and we oops at the beginning of the loop.
> So that !dax_mapping() check should also check for mapping != NULL.

Yes.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: dax_lock_mapping_entry was never safe
  2018-11-26 20:36   ` Dan Williams
@ 2018-11-27 18:59     ` Matthew Wilcox
  0 siblings, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2018-11-27 18:59 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, Jan Kara, linux-nvdimm

On Mon, Nov 26, 2018 at 12:36:26PM -0800, Dan Williams wrote:
> On Mon, Nov 26, 2018 at 9:11 AM Jan Kara <jack@suse.cz> wrote:
> > The code looks good. Maybe can we call this wait_entry_unlocked() to stress
> > that entry is not really usable after this function returns? And comment
> > before the function that this is safe to call even if we don't have a
> > reference keeping mapping alive?
> 
> Yes, maybe even something more ambiguous like "wait_entry_event()",
> because there's no guarantee the entry is unlocked just that now is a
> good time to try to interrogate the entry again.

It _became_ unlocked ... it might be locked again, or have disappeared
by the time we get to it, but somebody called dax_wake_entry() for this
exact entry.  I mean, we could call it wait_entry_wake(), but I think
that's a little generic.  I'm going with Jan's version ;-)

> > The continue here actually is not safe either because if the mapping got
> > freed, page->mapping will be NULL and we oops at the beginning of the loop.
> > So that !dax_mapping() check should also check for mapping != NULL.
> 
> Yes.

Sigh.  I'll make that a separate patch so it applies cleanly to 4.19.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-11-27 18:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-26 16:12 dax_lock_mapping_entry was never safe Matthew Wilcox
2018-11-26 17:11 ` Jan Kara
2018-11-26 20:36   ` Dan Williams
2018-11-27 18:59     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).