orangefs fixes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* orangefs fixes
@ 2020-03-26 17:07 Christoph Hellwig
  2020-03-26 17:07 ` [PATCH 1/2] Revert "orangefs: remember count when reading." Christoph Hellwig
  2020-03-26 17:07 ` [PATCH 2/2] orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush Christoph Hellwig
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2020-03-26 17:07 UTC (permalink / raw)
  To: Mike Marshall, Martin Brandenburg; +Cc: devel, Alexander Viro, linux-fsdevel

Hi Mike and Martin,

this fixes a pretty grave bug in the ->read_iter and ->flush interaction,
and also removes some copy and pasted code that isn't needed (and out of
date already)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] Revert "orangefs: remember count when reading."
  2020-03-26 17:07 orangefs fixes Christoph Hellwig
@ 2020-03-26 17:07 ` Christoph Hellwig
  2020-03-31 13:55   ` Mike Marshall
  2020-04-04 16:28   ` [PATCH] orangefs: complete Christoph's "remember count" reversion hubcap
  2020-03-26 17:07 ` [PATCH 2/2] orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush Christoph Hellwig
  1 sibling, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2020-03-26 17:07 UTC (permalink / raw)
  To: Mike Marshall, Martin Brandenburg; +Cc: devel, Alexander Viro, linux-fsdevel

->read_iter calls can race with each other and one or more ->flush calls.
Remove the the scheme to store the read count in the file private data
as is is completely racy and can cause use after free or double free
conditions.

This reverts commit c2549f8c7a28c00facaf911f700c4811cfd6f52b.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/orangefs/file.c            | 26 +-------------------------
 fs/orangefs/orangefs-kernel.h |  4 ----
 2 files changed, 1 insertion(+), 29 deletions(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index c740159d9ad1..173e6ea57a47 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -346,23 +346,8 @@ static ssize_t orangefs_file_read_iter(struct kiocb *iocb,
     struct iov_iter *iter)
 {
 	int ret;
-	struct orangefs_read_options *ro;
-
 	orangefs_stats.reads++;
 
-	/*
-	 * Remember how they set "count" in read(2) or pread(2) or whatever -
-	 * users can use count as a knob to control orangefs io size and later
-	 * we can try to help them fill as many pages as possible in readpage.
-	 */
-	if (!iocb->ki_filp->private_data) {
-		iocb->ki_filp->private_data = kmalloc(sizeof *ro, GFP_KERNEL);
-		if (!iocb->ki_filp->private_data)
-			return(ENOMEM);
-		ro = iocb->ki_filp->private_data;
-		ro->blksiz = iter->count;
-	}
-
 	down_read(&file_inode(iocb->ki_filp)->i_rwsem);
 	ret = orangefs_revalidate_mapping(file_inode(iocb->ki_filp));
 	if (ret)
@@ -650,12 +635,6 @@ static int orangefs_lock(struct file *filp, int cmd, struct file_lock *fl)
 	return rc;
 }
 
-static int orangefs_file_open(struct inode * inode, struct file *file)
-{
-	file->private_data = NULL;
-	return generic_file_open(inode, file);
-}
-
 static int orangefs_flush(struct file *file, fl_owner_t id)
 {
 	/*
@@ -669,9 +648,6 @@ static int orangefs_flush(struct file *file, fl_owner_t id)
 	struct inode *inode = file->f_mapping->host;
 	int r;
 
-	kfree(file->private_data);
-	file->private_data = NULL;
-
 	if (inode->i_state & I_DIRTY_TIME) {
 		spin_lock(&inode->i_lock);
 		inode->i_state &= ~I_DIRTY_TIME;
@@ -694,7 +670,7 @@ const struct file_operations orangefs_file_operations = {
 	.lock		= orangefs_lock,
 	.unlocked_ioctl	= orangefs_ioctl,
 	.mmap		= orangefs_file_mmap,
-	.open		= orangefs_file_open,
+	.open		= generic_file_open,
 	.flush		= orangefs_flush,
 	.release	= orangefs_file_release,
 	.fsync		= orangefs_fsync,
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index ed67f39fa7ce..e12aeb9623d6 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -239,10 +239,6 @@ struct orangefs_write_range {
 	kgid_t gid;
 };
 
-struct orangefs_read_options {
-	ssize_t blksiz;
-};
-
 extern struct orangefs_stats orangefs_stats;
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] Revert "orangefs: remember count when reading."
  2020-03-26 17:07 ` [PATCH 1/2] Revert "orangefs: remember count when reading." Christoph Hellwig
@ 2020-03-31 13:55   ` Mike Marshall
  2020-04-04 16:28   ` [PATCH] orangefs: complete Christoph's "remember count" reversion hubcap
  1 sibling, 0 replies; 7+ messages in thread
From: Mike Marshall @ 2020-03-31 13:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Brandenburg, devel, Alexander Viro, linux-fsdevel

Hi Christoph...

Thanks for the patches... the revision of c2549f8c (which I dreamed up :-) )
might hose up some other stuff with our recent page cache revisions,
though I have no doubt about your concerns... I'll post back after I've
run some tests...

-Mike

On Thu, Mar 26, 2020 at 1:07 PM Christoph Hellwig <hch@lst.de> wrote:
>
> ->read_iter calls can race with each other and one or more ->flush calls.
> Remove the the scheme to store the read count in the file private data
> as is is completely racy and can cause use after free or double free
> conditions.
>
> This reverts commit c2549f8c7a28c00facaf911f700c4811cfd6f52b.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/orangefs/file.c            | 26 +-------------------------
>  fs/orangefs/orangefs-kernel.h |  4 ----
>  2 files changed, 1 insertion(+), 29 deletions(-)
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index c740159d9ad1..173e6ea57a47 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -346,23 +346,8 @@ static ssize_t orangefs_file_read_iter(struct kiocb *iocb,
>      struct iov_iter *iter)
>  {
>         int ret;
> -       struct orangefs_read_options *ro;
> -
>         orangefs_stats.reads++;
>
> -       /*
> -        * Remember how they set "count" in read(2) or pread(2) or whatever -
> -        * users can use count as a knob to control orangefs io size and later
> -        * we can try to help them fill as many pages as possible in readpage.
> -        */
> -       if (!iocb->ki_filp->private_data) {
> -               iocb->ki_filp->private_data = kmalloc(sizeof *ro, GFP_KERNEL);
> -               if (!iocb->ki_filp->private_data)
> -                       return(ENOMEM);
> -               ro = iocb->ki_filp->private_data;
> -               ro->blksiz = iter->count;
> -       }
> -
>         down_read(&file_inode(iocb->ki_filp)->i_rwsem);
>         ret = orangefs_revalidate_mapping(file_inode(iocb->ki_filp));
>         if (ret)
> @@ -650,12 +635,6 @@ static int orangefs_lock(struct file *filp, int cmd, struct file_lock *fl)
>         return rc;
>  }
>
> -static int orangefs_file_open(struct inode * inode, struct file *file)
> -{
> -       file->private_data = NULL;
> -       return generic_file_open(inode, file);
> -}
> -
>  static int orangefs_flush(struct file *file, fl_owner_t id)
>  {
>         /*
> @@ -669,9 +648,6 @@ static int orangefs_flush(struct file *file, fl_owner_t id)
>         struct inode *inode = file->f_mapping->host;
>         int r;
>
> -       kfree(file->private_data);
> -       file->private_data = NULL;
> -
>         if (inode->i_state & I_DIRTY_TIME) {
>                 spin_lock(&inode->i_lock);
>                 inode->i_state &= ~I_DIRTY_TIME;
> @@ -694,7 +670,7 @@ const struct file_operations orangefs_file_operations = {
>         .lock           = orangefs_lock,
>         .unlocked_ioctl = orangefs_ioctl,
>         .mmap           = orangefs_file_mmap,
> -       .open           = orangefs_file_open,
> +       .open           = generic_file_open,
>         .flush          = orangefs_flush,
>         .release        = orangefs_file_release,
>         .fsync          = orangefs_fsync,
> diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
> index ed67f39fa7ce..e12aeb9623d6 100644
> --- a/fs/orangefs/orangefs-kernel.h
> +++ b/fs/orangefs/orangefs-kernel.h
> @@ -239,10 +239,6 @@ struct orangefs_write_range {
>         kgid_t gid;
>  };
>
> -struct orangefs_read_options {
> -       ssize_t blksiz;
> -};
> -
>  extern struct orangefs_stats orangefs_stats;
>
>  /*
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] orangefs: complete Christoph's "remember count" reversion.
  2020-03-26 17:07 ` [PATCH 1/2] Revert "orangefs: remember count when reading." Christoph Hellwig
  2020-03-31 13:55   ` Mike Marshall
@ 2020-04-04 16:28   ` hubcap
  2020-04-04 17:43     ` Matthew Wilcox
  1 sibling, 1 reply; 7+ messages in thread
From: hubcap @ 2020-04-04 16:28 UTC (permalink / raw)
  To: hch; +Cc: Mike Marshall, devel, linux-fsdevel

From: Mike Marshall <hubcap@omnibond.com>

---------- added text for this mail message ----------------
Here's a related patch. Something like this needs to be added
to Christoph's reversion in order for Orangefs to compile and
keep working. Using this static read size rather than using
the reverted logic to determine a read size seems faster in
my tests anyhow.

I accept Christoph's assertion that there would be a race, and
I looked at some of the vfs code (vfs_read, new_sync_read and
related). I guess the race Christoph sees would happen in
a threaded userspace program? I would be a better maintainer
if I saw the race more clearly, if anyone wants to go into
it. 

I guess I could employ a locking scheme if I wanted keep the
reverted code (I don't) instead of getting rid of it?

As an aside, the page cache has been a blessing and a curse
for us. Since we started using it, small IO has improved
incredibly, but our max speed hits a plateau before it otherwise
would have. I think because of all the page size copies we have
to do to fill our 4 meg native buffers. I try to read about all
the new work going into the page cache in lwn, and make some
sense of the new code :-). One thing I remember is when
Christoph Lameter said "the page cache does not scale", but
I know the new work is focused on that. If anyone has any
thoughts about how we could make improvments on filling our
native buffers from the page cache (larger page sizes?),
feel free to offer any help...

Anywho... thanks and I'll try to get this patch and
Christoph's two pulled if nobody sees a problem with them...

-Mike
------------------------------------------------------------

Logically, optimal Orangefs "pages" are 4 megabytes. Reading large Orangefs
files 4096 bytes at a time is like trying to kick a dead whale
down the beach. Before Christoph's "Revert orangefs: remember
count when reading." I tried to give users a knob whereby they could,
for example, use "count" in read(2) or bs with dd(1) to get
whatever they considered an appropriate amount of bytes at
a time from Orangefs and fill as many page cache pages as they
could at once.

Without the racy code that Christoph reverted Orangefs won't
even compile, much less work. So this replaces the logic that
used the private file data that Christoph reverted with
a static number of bytes to read from Orangefs.

I ran tests like the following to determine what a
reasonable static number of bytes might be:

dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
                          .
                          .
                          .
dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128

Reads seem faster using the static number, so my "knob code"
wasn't just racy, it wasn't even a good idea...

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
---
 fs/orangefs/inode.c | 39 ++++++---------------------------------
 1 file changed, 6 insertions(+), 33 deletions(-)

diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 961c0fd8675a..fb0884626d18 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -259,46 +259,19 @@ static int orangefs_readpage(struct file *file, struct page *page)
 	pgoff_t index; /* which page */
 	struct page *next_page;
 	char *kaddr;
-	struct orangefs_read_options *ro = file->private_data;
 	loff_t read_size;
-	loff_t roundedup;
 	int buffer_index = -1; /* orangefs shared memory slot */
 	int slot_index;   /* index into slot */
 	int remaining;

 	/*
-	 * If they set some miniscule size for "count" in read(2)
-	 * (for example) then let's try to read a page, or the whole file
-	 * if it is smaller than a page. Once "count" goes over a page
-	 * then lets round up to the highest page size multiple that is
-	 * less than or equal to "count" and do that much orangefs IO and
-	 * try to fill as many pages as we can from it.
-	 *
-	 * "count" should be represented in ro->blksiz.
-	 *
-	 * inode->i_size = file size.
+	 * Get up to this many bytes from Orangefs at a time and try
+	 * to fill them into the page cache at once.
+	 * Tests with dd made this seem like a reasonable static
+	 * number, if there was interest perhaps this number could
+	 * be made setable through sysfs...
 	 */
-	if (ro) {
-		if (ro->blksiz < PAGE_SIZE) {
-			if (inode->i_size < PAGE_SIZE)
-				read_size = inode->i_size;
-			else
-				read_size = PAGE_SIZE;
-		} else {
-			roundedup = ((PAGE_SIZE - 1) & ro->blksiz) ?
-				((ro->blksiz + PAGE_SIZE) & ~(PAGE_SIZE -1)) :
-				ro->blksiz;
-			if (roundedup > inode->i_size)
-				read_size = inode->i_size;
-			else
-				read_size = roundedup;
-
-		}
-	} else {
-		read_size = PAGE_SIZE;
-	}
-	if (!read_size)
-		read_size = PAGE_SIZE;
+	read_size = 524288;

 	if (PageDirty(page))
 		orangefs_launder_page(page);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] orangefs: complete Christoph's "remember count" reversion.
  2020-04-04 16:28   ` [PATCH] orangefs: complete Christoph's "remember count" reversion hubcap
@ 2020-04-04 17:43     ` Matthew Wilcox
  2020-04-04 20:57       ` Mike Marshall
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2020-04-04 17:43 UTC (permalink / raw)
  To: hubcap; +Cc: hch, Mike Marshall, devel, linux-fsdevel

On Sat, Apr 04, 2020 at 12:28:26PM -0400, hubcap@kernel.org wrote:
> As an aside, the page cache has been a blessing and a curse
> for us. Since we started using it, small IO has improved
> incredibly, but our max speed hits a plateau before it otherwise
> would have. I think because of all the page size copies we have
> to do to fill our 4 meg native buffers. I try to read about all
> the new work going into the page cache in lwn, and make some
> sense of the new code :-). One thing I remember is when
> Christoph Lameter said "the page cache does not scale", but
> I know the new work is focused on that. If anyone has any
> thoughts about how we could make improvments on filling our
> native buffers from the page cache (larger page sizes?),
> feel free to offer any help...

Umm, 4MB native buffers are ... really big ;-)  I wasn't planning on
going past PMD_SIZE (ie 2MB on x86) for the readahead large pages,
but if a filesystem wants that, then I should change that plan.

What I was planning for, but don't quite have an implementation nailed
down for yet, is allowing filesystems to grow the readahead beyond that
wanted by the generic code.  Filesystems which implement compression
frequently want blocks in the 256kB size range.  It seems like OrangeFS
would fit with that scheme, as long as I don't put a limit on what the
filesystem asks for.

So yes, I think within the next year, you should be able to tell the
page cache to allocate 4MB pages.  You will still need a fallback path
for when memory is too fragmented to allocate new pages, but if you're
using 4MB pages, then hopefully we'll be able to reclaim a clean 4MB
pages from elsewhere in the page cache and supply you with a new one.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] orangefs: complete Christoph's "remember count" reversion.
  2020-04-04 17:43     ` Matthew Wilcox
@ 2020-04-04 20:57       ` Mike Marshall
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Marshall @ 2020-04-04 20:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: hubcap, Christoph Hellwig, devel, linux-fsdevel

Matthew>  So yes, I think within the next year, you should be
Matthew>  able to tell the page cache to allocate 4MB pages.

I can't find the ascii thumbs up emoji :-) ...

-Mike

On Sat, Apr 4, 2020 at 1:43 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Apr 04, 2020 at 12:28:26PM -0400, hubcap@kernel.org wrote:
> > As an aside, the page cache has been a blessing and a curse
> > for us. Since we started using it, small IO has improved
> > incredibly, but our max speed hits a plateau before it otherwise
> > would have. I think because of all the page size copies we have
> > to do to fill our 4 meg native buffers. I try to read about all
> > the new work going into the page cache in lwn, and make some
> > sense of the new code :-). One thing I remember is when
> > Christoph Lameter said "the page cache does not scale", but
> > I know the new work is focused on that. If anyone has any
> > thoughts about how we could make improvments on filling our
> > native buffers from the page cache (larger page sizes?),
> > feel free to offer any help...
>
> Umm, 4MB native buffers are ... really big ;-)  I wasn't planning on
> going past PMD_SIZE (ie 2MB on x86) for the readahead large pages,
> but if a filesystem wants that, then I should change that plan.
>
> What I was planning for, but don't quite have an implementation nailed
> down for yet, is allowing filesystems to grow the readahead beyond that
> wanted by the generic code.  Filesystems which implement compression
> frequently want blocks in the 256kB size range.  It seems like OrangeFS
> would fit with that scheme, as long as I don't put a limit on what the
> filesystem asks for.
>
> So yes, I think within the next year, you should be able to tell the
> page cache to allocate 4MB pages.  You will still need a fallback path
> for when memory is too fragmented to allocate new pages, but if you're
> using 4MB pages, then hopefully we'll be able to reclaim a clean 4MB
> pages from elsewhere in the page cache and supply you with a new one.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush
  2020-03-26 17:07 orangefs fixes Christoph Hellwig
  2020-03-26 17:07 ` [PATCH 1/2] Revert "orangefs: remember count when reading." Christoph Hellwig
@ 2020-03-26 17:07 ` Christoph Hellwig
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2020-03-26 17:07 UTC (permalink / raw)
  To: Mike Marshall, Martin Brandenburg; +Cc: devel, Alexander Viro, linux-fsdevel

orangefs_flush just writes out data on every close(2) call.  There is no
need to change anything about the dirty state, especially as orangefs
doesn't treat I_DIRTY_TIMES special in any way.  The code seems to
come from partially open coding vfs_fsync.

Fixes: 90fc07065a35 ("orangefs: avoid fsync service operation on flush")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/orangefs/file.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 173e6ea57a47..af375e049aae 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -645,16 +645,8 @@ static int orangefs_flush(struct file *file, fl_owner_t id)
 	 * on an explicit fsync call.  This duplicates historical OrangeFS
 	 * behavior.
 	 */
-	struct inode *inode = file->f_mapping->host;
 	int r;
 
-	if (inode->i_state & I_DIRTY_TIME) {
-		spin_lock(&inode->i_lock);
-		inode->i_state &= ~I_DIRTY_TIME;
-		spin_unlock(&inode->i_lock);
-		mark_inode_dirty_sync(inode);
-	}
-
 	r = filemap_write_and_wait_range(file->f_mapping, 0, LLONG_MAX);
 	if (r > 0)
 		return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-04-04 20:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26 17:07 orangefs fixes Christoph Hellwig
2020-03-26 17:07 ` [PATCH 1/2] Revert "orangefs: remember count when reading." Christoph Hellwig
2020-03-31 13:55   ` Mike Marshall
2020-04-04 16:28   ` [PATCH] orangefs: complete Christoph's "remember count" reversion hubcap
2020-04-04 17:43     ` Matthew Wilcox
2020-04-04 20:57       ` Mike Marshall
2020-03-26 17:07 ` [PATCH 2/2] orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).