All of lore.kernel.org
 help / color / mirror / Atom feed
* Uninitialized extent races
@ 2012-12-21  1:25 Jan Kara
  2012-12-21  3:11 ` Theodore Ts'o
  2012-12-21 12:34 ` Dmitry Monakhov
  0 siblings, 2 replies; 12+ messages in thread
From: Jan Kara @ 2012-12-21  1:25 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4

  Hello,

  I was looking into handling of unwritten extents and I've noticed two
things:
1) Commit 60d4616f added ext4_flush_unwritten_io() call into
   ext4_fallocate() with a comment that it should avoid a race between
   AIO DIO and fallocate. But that really doesn't completely fix the issue
   because writeback can create unwritten extents as well and that is not
   synchronized with i_mutex unlike AIO DIO.

2) Commit dee1f973 actually changed ext4_convert_unwritten_extents_endio()
   to be able to deal with the situation when extent we want to convert
   isn't exactly the one in the extent tree (i.e., it was merged with
   another extent in the mean time). That was the situation 1) tried to
   prevent so it seems to make ext4_flush_unwritten_io() call from 1)
   unnecessary?

Am I missing something Dmitry? Also I was wondering about one thing: Does
anybody see a problem with disabling merging of uninitialized extents
completely? It would simplify the code (end_io conversion doesn't need to
potentially split extents) and the case when we really want to merge
extents - i.e., when someone calls fallocate() on small chunks - doesn't
seem like the case we need to optimize for? Also it would bound the amount of
transaction credits we need for conversion to 1 block which would make it
easier for me to change ext4 to clear PageWriteback only after extent
conversion is done (again code simplification, more uniform handling of
page writeback).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-21  1:25 Uninitialized extent races Jan Kara
@ 2012-12-21  3:11 ` Theodore Ts'o
  2012-12-21 16:19   ` Jan Kara
  2012-12-21 12:34 ` Dmitry Monakhov
  1 sibling, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2012-12-21  3:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4

On Fri, Dec 21, 2012 at 02:25:26AM +0100, Jan Kara wrote:
> Am I missing something Dmitry? Also I was wondering about one thing: Does
> anybody see a problem with disabling merging of uninitialized extents
> completely? It would simplify the code (end_io conversion doesn't need to
> potentially split extents) and the case when we really want to merge
> extents - i.e., when someone calls fallocate() on small chunks - doesn't
> seem like the case we need to optimize for? 

Which case specifically are you talking about here?

Are you talking about the merging of _formerly_ uninitialized extents?
i.e., what keeps the extent tree from exploding if you fallocate one
megabyte region, and then write to all 256 blocks of that one megabyte
region, except in a random order?

Or something else?

> Also it would bound the amount of transaction credits we need for
> conversion to 1 block which would make it easier for me to change
> ext4 to clear PageWriteback only after extent conversion is done
> (again code simplification, more uniform handling of page
> writeback).

So I'm confused.  If it's the case that we're thinking about, we only
need a single transaction credit, because we're not currently merging
across adjacent interior extent tree blocks.

Can you be a bit more explicit about which case you're thinking about?
I do agree that the extent tree code is too complicated, but we also
have the problem that we probably been more merging, not less, since
we can already end up with a case where you start with a single extent
tree block after fallocating a gigabyte or two.  Then after writing
randomly into that gigabyte file using AIO, we can end up with a very
deep, spindly extent tree containing multiple interior extent tree
blocks, because we're not doing sufficient merging --- and in
particular, we currently have no way at all of decreasing the depth of
the extent tree.

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-21  1:25 Uninitialized extent races Jan Kara
  2012-12-21  3:11 ` Theodore Ts'o
@ 2012-12-21 12:34 ` Dmitry Monakhov
  1 sibling, 0 replies; 12+ messages in thread
From: Dmitry Monakhov @ 2012-12-21 12:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, 21 Dec 2012 02:25:26 +0100, Jan Kara <jack@suse.cz> wrote:
>   Hello,
> 
>   I was looking into handling of unwritten extents and I've noticed two
> things:
> 1) Commit 60d4616f added ext4_flush_unwritten_io() call into
>    ext4_fallocate() with a comment that it should avoid a race between
>    AIO DIO and fallocate. But that really doesn't completely fix the issue
>    because writeback can create unwritten extents as well and that is not
>    synchronized with i_mutex unlike AIO DIO.
> 
> 2) Commit dee1f973 actually changed ext4_convert_unwritten_extents_endio()
>    to be able to deal with the situation when extent we want to convert
>    isn't exactly the one in the extent tree (i.e., it was merged with
>    another extent in the mean time). That was the situation 1) tried to
>    prevent so it seems to make ext4_flush_unwritten_io() call from 1)
>    unnecessary?
In fact first patch is still required because dee1f973 may result in
in-depth tree grow so ENOSPC is not unprobable which result in 
complain from ext4_end_io().
> 
> Am I missing something Dmitry? Also I was wondering about one thing: Does
> anybody see a problem with disabling merging of uninitialized extents
> completely?
Yeah, actually this is sound idea. Instead of fighting again weird races
it is reasonable to just prevent it, because merge uninitialized extents
is false optimization anyway. Later it will be merged after if extents
becomes initialized(which is right place for optimization).
> It would simplify the code (end_io conversion doesn't need to
> potentially split extents) and the case when we really want to merge
> extents - i.e., when someone calls fallocate() on small chunks - doesn't
> seem like the case we need to optimize for? Also it would bound the amount of
> transaction credits we need for conversion to 1 block which would make it
> easier for me to change ext4 to clear PageWriteback only after extent
> conversion is done (again code simplification, more uniform handling of
> page writeback).
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-21  3:11 ` Theodore Ts'o
@ 2012-12-21 16:19   ` Jan Kara
  2012-12-21 18:02     ` Theodore Ts'o
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2012-12-21 16:19 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Dmitry Monakhov, linux-ext4

On Thu 20-12-12 22:11:51, Ted Tso wrote:
> On Fri, Dec 21, 2012 at 02:25:26AM +0100, Jan Kara wrote:
> > Am I missing something Dmitry? Also I was wondering about one thing: Does
> > anybody see a problem with disabling merging of uninitialized extents
> > completely? It would simplify the code (end_io conversion doesn't need to
> > potentially split extents) and the case when we really want to merge
> > extents - i.e., when someone calls fallocate() on small chunks - doesn't
> > seem like the case we need to optimize for? 
> 
> Which case specifically are you talking about here?
> 
> Are you talking about the merging of _formerly_ uninitialized extents?
> i.e., what keeps the extent tree from exploding if you fallocate one
> megabyte region, and then write to all 256 blocks of that one megabyte
> region, except in a random order?
> 
> Or something else?
  No, I'm speaking about merging currently uninitialized extents. I.e.
suppose someone does the following on a filesystem with dioread_nolock so
that writeback happens via unwritten extents:
  fd = open("file", O_RDWR);
  pwrite(fd, buf, 4096, 0);
					flusher thread starts writing
					we create uninitialized extent for
					  range 0-4096
  fallocate(fd, 0, 4096, 4096);
    - we merge extents and now have just 1 uninitialized extent for range
      0-8192
					ext4_convert_unwritten_extents() now
					  has to split the extent to finish
					  the IO.

Now splitting the extent requires number of credits proportional to the
tree depth, maybe even allocation... And strictly speaking number of
credits is impossible to reliably estimate until you hold i_data_sem (tree
can grow until we hold that semaphore) which is too late - we need to start
a transaction before we take that semaphore.

So if we disabled merging of extents that are currently uninitialized,
above problem couldn't happen. We would know we only convert that one
extent and possibly merge it in the leaf to other extents.

> > Also it would bound the amount of transaction credits we need for
> > conversion to 1 block which would make it easier for me to change
> > ext4 to clear PageWriteback only after extent conversion is done
> > (again code simplification, more uniform handling of page
> > writeback).
> 
> So I'm confused.  If it's the case that we're thinking about, we only
> need a single transaction credit, because we're not currently merging
> across adjacent interior extent tree blocks.
> 
> Can you be a bit more explicit about which case you're thinking about?
> I do agree that the extent tree code is too complicated, but we also
> have the problem that we probably been more merging, not less, since
> we can already end up with a case where you start with a single extent
> tree block after fallocating a gigabyte or two.  Then after writing
> randomly into that gigabyte file using AIO, we can end up with a very
> deep, spindly extent tree containing multiple interior extent tree
> blocks, because we're not doing sufficient merging --- and in
> particular, we currently have no way at all of decreasing the depth of
> the extent tree.
  Yeah, so I agree that's a problem in some cases but my suggestion
shouldn't really make this any worse. We *will* still be merging normal
extents after they are converted in end_io handler... Just we won't merge
them while they are still uninitialized.

And I regarding more merging, that could be done (obviously), just we might
need to postpone that after writeback is finished (PageWriteback is
cleared) because there extent estimates are not clear. And I need to know
necessary number of extents well in advance to be able to reserve credits
in the journal. OTOH maybe we could use jbd2_journal_extend() to get more
credits if we need them for merging. And when that fails, bad luck but we
can cope... Anyway, this is a different problem.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-21 16:19   ` Jan Kara
@ 2012-12-21 18:02     ` Theodore Ts'o
  2012-12-21 22:49       ` Jan Kara
  2012-12-24 11:17       ` Zheng Liu
  0 siblings, 2 replies; 12+ messages in thread
From: Theodore Ts'o @ 2012-12-21 18:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4

On Fri, Dec 21, 2012 at 05:19:29PM +0100, Jan Kara wrote:
>   No, I'm speaking about merging currently uninitialized extents. I.e.
> suppose someone does the following on a filesystem with dioread_nolock so
> that writeback happens via unwritten extents:
>   fd = open("file", O_RDWR);
>   pwrite(fd, buf, 4096, 0);
> 					flusher thread starts writing
> 					we create uninitialized extent for
> 					  range 0-4096
>   fallocate(fd, 0, 4096, 4096);
>     - we merge extents and now have just 1 uninitialized extent for range
>       0-8192
> 					ext4_convert_unwritten_extents() now
> 					  has to split the extent to finish
> 					  the IO.

Ah, I see.  Disabling the the merging that might take place as a
result of the fallocate.  Yes, I agree that's a completely sane thing
to do.

The alternate approach would be to add a flag in the extent status
tree indicating that an unwritten conversion is pending, but that
would add more complexity.

Hmmm.... do we need that complexity anyway?  What happens if we have a
race between a punch (or truncate) and the flusher thread, so there is
pending write.  There are two things that would be of concern.  (1)
Will convert_unwritten_extents do the right thing if the extent in
question has disappeared, and (2) what if the block gets reused for
some other inode in the interim?

I _think_ we're OK in the case of (2), since we're not using FUA
writes for anything other than the commit block, so there shouldn't be
any way that a write for the new inode could complete before the
pending write finishes up.  And (1) should be OK, although it may end
up triggering a WARN_ON and a scarry ext4_msg() in
ext4_convert_unwritten_extents().   But it made me stop and think....

> And I regarding more merging, that could be done (obviously), just we might
> need to postpone that after writeback is finished (PageWriteback is
> cleared) because there extent estimates are not clear. And I need to know
> necessary number of extents well in advance to be able to reserve credits
> in the journal. OTOH maybe we could use jbd2_journal_extend() to get more
> credits if we need them for merging. And when that fails, bad luck but we
> can cope... Anyway, this is a different problem.

Yeah, using jbd2_journal_extend() was what I was thinking about doing
where we could do some opportunistic merging if there's room in the
journal to allow that.  But I agree that's a different problem....

	   	 	      	    - Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-21 18:02     ` Theodore Ts'o
@ 2012-12-21 22:49       ` Jan Kara
  2012-12-21 23:03         ` Theodore Ts'o
  2012-12-24 11:17       ` Zheng Liu
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Kara @ 2012-12-21 22:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Dmitry Monakhov, linux-ext4

On Fri 21-12-12 13:02:43, Ted Tso wrote:
> On Fri, Dec 21, 2012 at 05:19:29PM +0100, Jan Kara wrote:
> >   No, I'm speaking about merging currently uninitialized extents. I.e.
> > suppose someone does the following on a filesystem with dioread_nolock so
> > that writeback happens via unwritten extents:
> >   fd = open("file", O_RDWR);
> >   pwrite(fd, buf, 4096, 0);
> > 					flusher thread starts writing
> > 					we create uninitialized extent for
> > 					  range 0-4096
> >   fallocate(fd, 0, 4096, 4096);
> >     - we merge extents and now have just 1 uninitialized extent for range
> >       0-8192
> > 					ext4_convert_unwritten_extents() now
> > 					  has to split the extent to finish
> > 					  the IO.
> 
> Ah, I see.  Disabling the the merging that might take place as a
> result of the fallocate.  Yes, I agree that's a completely sane thing
> to do.
  OK, I'll write some patches.

> The alternate approach would be to add a flag in the extent status
> tree indicating that an unwritten conversion is pending, but that
> would add more complexity.
> 
> Hmmm.... do we need that complexity anyway?  What happens if we have a
> race between a punch (or truncate) and the flusher thread, so there is
> pending write.  There are two things that would be of concern.  (1)
> Will convert_unwritten_extents do the right thing if the extent in
> question has disappeared, and (2) what if the block gets reused for
> some other inode in the interim?
> 
> I _think_ we're OK in the case of (2), since we're not using FUA writes
> for anything other than the commit block, so there shouldn't be any way
> that a write for the new inode could complete before the pending write
> finishes up.  And (1) should be OK, although it may end up triggering a
> WARN_ON and a scarry ext4_msg() in ext4_convert_unwritten_extents().
> But it made me stop and think....
  It's actually simpler than that. We wait for any pending DIO using
inode_dio_wait() and i_mutex protects from new writes to be submitted. So
that takes care of one possibility. truncate_inode_pages() waits for
PageWriteback bit so that handles waiting for IO itself. After I change
ext4 to convert extents before clearing PageWriteback, this will take care
also of extent conversion. Now a call to ext4_flush_unwritten_io() in
ext4_ext_truncate() resolves the problems. It's called after invalidating
page cache so we know all the pending IO for the truncated / punched area
is finished, just a conversion may be still pending.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-21 22:49       ` Jan Kara
@ 2012-12-21 23:03         ` Theodore Ts'o
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Ts'o @ 2012-12-21 23:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4

On Fri, Dec 21, 2012 at 11:49:47PM +0100, Jan Kara wrote:
>   It's actually simpler than that. We wait for any pending DIO using
> inode_dio_wait() and i_mutex protects from new writes to be submitted. So
> that takes care of one possibility. truncate_inode_pages() waits for
> PageWriteback bit so that handles waiting for IO itself. 

Hmm, yes, I should have known/remembered that.  I've seen cases where
very rarely, it's possible for a unlink() or truncate() call to stall
for multiple minutes(!).  This can happen if you have writeback
happening in a container which has a very small (low priority)
constraint on its block I/O bandwidth.  If you try to delete an inode
which has writeback work pending, it's possible for the writeback to
take a looong time, which in turn causes the unlink to take a long
time.

It becomes worse the process doing the unlink is a high priority
process (say, the cluster management daemon who is cleaning up after
said low-priority job has completed), but the writeback is happening
in the context of a low priority cgroup.  You can end up with a nasty
priority inversion.

And there's not a lot we can do at the kernel level.  We could
dispatch the truncate to a workqueue and just make sure the file name
has disappeared from the file system name space before the unlink() to
userspace, but then the disk space gets released after the unlink()
call returns, which can cause other problems.

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-21 18:02     ` Theodore Ts'o
  2012-12-21 22:49       ` Jan Kara
@ 2012-12-24 11:17       ` Zheng Liu
  2012-12-31  8:32         ` Jan Kara
  1 sibling, 1 reply; 12+ messages in thread
From: Zheng Liu @ 2012-12-24 11:17 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, Dmitry Monakhov, linux-ext4

On Fri, Dec 21, 2012 at 01:02:43PM -0500, Theodore Ts'o wrote:
> On Fri, Dec 21, 2012 at 05:19:29PM +0100, Jan Kara wrote:
> >   No, I'm speaking about merging currently uninitialized extents. I.e.
> > suppose someone does the following on a filesystem with dioread_nolock so
> > that writeback happens via unwritten extents:
> >   fd = open("file", O_RDWR);
> >   pwrite(fd, buf, 4096, 0);
> > 					flusher thread starts writing
> > 					we create uninitialized extent for
> > 					  range 0-4096
> >   fallocate(fd, 0, 4096, 4096);
> >     - we merge extents and now have just 1 uninitialized extent for range
> >       0-8192
> > 					ext4_convert_unwritten_extents() now
> > 					  has to split the extent to finish
> > 					  the IO.
> 
> Ah, I see.  Disabling the the merging that might take place as a
> result of the fallocate.  Yes, I agree that's a completely sane thing
> to do.
> 
> The alternate approach would be to add a flag in the extent status
> tree indicating that an unwritten conversion is pending, but that
> would add more complexity.

Hi Ted and Jan,

Sorry for delay reply.  Indeed we could add a flag in extent status tree
to indicate an pending unwritten extent, and I believe that it can bring
us some benefits.  But I wonder whether this case often happens.  Do we
have some real workloads?

Regards,
                                        - Zheng

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-24 11:17       ` Zheng Liu
@ 2012-12-31  8:32         ` Jan Kara
  2012-12-31 16:31           ` Zheng Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2012-12-31  8:32 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Theodore Ts'o, Jan Kara, Dmitry Monakhov, linux-ext4

On Mon 24-12-12 19:17:45, Zheng Liu wrote:
> On Fri, Dec 21, 2012 at 01:02:43PM -0500, Theodore Ts'o wrote:
> > On Fri, Dec 21, 2012 at 05:19:29PM +0100, Jan Kara wrote:
> > >   No, I'm speaking about merging currently uninitialized extents. I.e.
> > > suppose someone does the following on a filesystem with dioread_nolock so
> > > that writeback happens via unwritten extents:
> > >   fd = open("file", O_RDWR);
> > >   pwrite(fd, buf, 4096, 0);
> > > 					flusher thread starts writing
> > > 					we create uninitialized extent for
> > > 					  range 0-4096
> > >   fallocate(fd, 0, 4096, 4096);
> > >     - we merge extents and now have just 1 uninitialized extent for range
> > >       0-8192
> > > 					ext4_convert_unwritten_extents() now
> > > 					  has to split the extent to finish
> > > 					  the IO.
> > 
> > Ah, I see.  Disabling the the merging that might take place as a
> > result of the fallocate.  Yes, I agree that's a completely sane thing
> > to do.
> > 
> > The alternate approach would be to add a flag in the extent status
> > tree indicating that an unwritten conversion is pending, but that
> > would add more complexity.
> 
> Sorry for delay reply.  Indeed we could add a flag in extent status tree
> to indicate an pending unwritten extent, and I believe that it can bring
> us some benefits.  But I wonder whether this case often happens.  Do we
> have some real workloads?
  It doesn't happen often but it *can* happen. Thus you have to implement
a code which handles the case. I don't think bit in extent status tree is
really necessary. Just disabling merging of uninitialized extents is
simple. If we see there are some real workloads which have problems with
it, we can resort to a more complex solution using extent tree...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-31  8:32         ` Jan Kara
@ 2012-12-31 16:31           ` Zheng Liu
  2012-12-31 16:44             ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Zheng Liu @ 2012-12-31 16:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, Dmitry Monakhov, linux-ext4

On Mon, Dec 31, 2012 at 09:32:21AM +0100, Jan Kara wrote:
> On Mon 24-12-12 19:17:45, Zheng Liu wrote:
> > On Fri, Dec 21, 2012 at 01:02:43PM -0500, Theodore Ts'o wrote:
> > > On Fri, Dec 21, 2012 at 05:19:29PM +0100, Jan Kara wrote:
> > > >   No, I'm speaking about merging currently uninitialized extents. I.e.
> > > > suppose someone does the following on a filesystem with dioread_nolock so
> > > > that writeback happens via unwritten extents:
> > > >   fd = open("file", O_RDWR);
> > > >   pwrite(fd, buf, 4096, 0);
> > > > 					flusher thread starts writing
> > > > 					we create uninitialized extent for
> > > > 					  range 0-4096
> > > >   fallocate(fd, 0, 4096, 4096);
> > > >     - we merge extents and now have just 1 uninitialized extent for range
> > > >       0-8192
> > > > 					ext4_convert_unwritten_extents() now
> > > > 					  has to split the extent to finish
> > > > 					  the IO.
> > > 
> > > Ah, I see.  Disabling the the merging that might take place as a
> > > result of the fallocate.  Yes, I agree that's a completely sane thing
> > > to do.
> > > 
> > > The alternate approach would be to add a flag in the extent status
> > > tree indicating that an unwritten conversion is pending, but that
> > > would add more complexity.
> > 
> > Sorry for delay reply.  Indeed we could add a flag in extent status tree
> > to indicate an pending unwritten extent, and I believe that it can bring
> > us some benefits.  But I wonder whether this case often happens.  Do we
> > have some real workloads?
>   It doesn't happen often but it *can* happen. Thus you have to implement
> a code which handles the case. I don't think bit in extent status tree is
> really necessary. Just disabling merging of uninitialized extents is
> simple. If we see there are some real workloads which have problems with
> it, we can resort to a more complex solution using extent tree...

Thanks for your explanation.  I don't know whether or not you have
generated a patch for this problem.  I am willing to make it in a proper
time.  If you have begun to generate it, please let me know. :-)

Happy New Year,
                                                - Zheng

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-31 16:31           ` Zheng Liu
@ 2012-12-31 16:44             ` Jan Kara
  2013-01-01  4:49               ` Zheng Liu
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2012-12-31 16:44 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Jan Kara, Theodore Ts'o, Dmitry Monakhov, linux-ext4

On Tue 01-01-13 00:31:46, Zheng Liu wrote:
> On Mon, Dec 31, 2012 at 09:32:21AM +0100, Jan Kara wrote:
> > On Mon 24-12-12 19:17:45, Zheng Liu wrote:
> > > On Fri, Dec 21, 2012 at 01:02:43PM -0500, Theodore Ts'o wrote:
> > > > On Fri, Dec 21, 2012 at 05:19:29PM +0100, Jan Kara wrote:
> > > > >   No, I'm speaking about merging currently uninitialized extents. I.e.
> > > > > suppose someone does the following on a filesystem with dioread_nolock so
> > > > > that writeback happens via unwritten extents:
> > > > >   fd = open("file", O_RDWR);
> > > > >   pwrite(fd, buf, 4096, 0);
> > > > > 					flusher thread starts writing
> > > > > 					we create uninitialized extent for
> > > > > 					  range 0-4096
> > > > >   fallocate(fd, 0, 4096, 4096);
> > > > >     - we merge extents and now have just 1 uninitialized extent for range
> > > > >       0-8192
> > > > > 					ext4_convert_unwritten_extents() now
> > > > > 					  has to split the extent to finish
> > > > > 					  the IO.
> > > > 
> > > > Ah, I see.  Disabling the the merging that might take place as a
> > > > result of the fallocate.  Yes, I agree that's a completely sane thing
> > > > to do.
> > > > 
> > > > The alternate approach would be to add a flag in the extent status
> > > > tree indicating that an unwritten conversion is pending, but that
> > > > would add more complexity.
> > > 
> > > Sorry for delay reply.  Indeed we could add a flag in extent status tree
> > > to indicate an pending unwritten extent, and I believe that it can bring
> > > us some benefits.  But I wonder whether this case often happens.  Do we
> > > have some real workloads?
> >   It doesn't happen often but it *can* happen. Thus you have to implement
> > a code which handles the case. I don't think bit in extent status tree is
> > really necessary. Just disabling merging of uninitialized extents is
> > simple. If we see there are some real workloads which have problems with
> > it, we can resort to a more complex solution using extent tree...
> 
> Thanks for your explanation.  I don't know whether or not you have
> generated a patch for this problem.  I am willing to make it in a proper
> time.  If you have begun to generate it, please let me know. :-)
  Disabling the merging is trivial and I have a patch for that. Just making
all other changes so that Christoph's DIO patches can work is non-trivial.
I already have several smaller fixes and cleanups to make things easier but
writeback path still has locking issues - I have a solution in mind but
whether it will be needed or not depends on what I asked in the other email
- whether extent status tree can really be used or not...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Uninitialized extent races
  2012-12-31 16:44             ` Jan Kara
@ 2013-01-01  4:49               ` Zheng Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Zheng Liu @ 2013-01-01  4:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Theodore Ts'o, Dmitry Monakhov, linux-ext4

On Mon, Dec 31, 2012 at 05:44:46PM +0100, Jan Kara wrote:
> On Tue 01-01-13 00:31:46, Zheng Liu wrote:
> > On Mon, Dec 31, 2012 at 09:32:21AM +0100, Jan Kara wrote:
> > > On Mon 24-12-12 19:17:45, Zheng Liu wrote:
> > > > On Fri, Dec 21, 2012 at 01:02:43PM -0500, Theodore Ts'o wrote:
> > > > > On Fri, Dec 21, 2012 at 05:19:29PM +0100, Jan Kara wrote:
> > > > > >   No, I'm speaking about merging currently uninitialized extents. I.e.
> > > > > > suppose someone does the following on a filesystem with dioread_nolock so
> > > > > > that writeback happens via unwritten extents:
> > > > > >   fd = open("file", O_RDWR);
> > > > > >   pwrite(fd, buf, 4096, 0);
> > > > > > 					flusher thread starts writing
> > > > > > 					we create uninitialized extent for
> > > > > > 					  range 0-4096
> > > > > >   fallocate(fd, 0, 4096, 4096);
> > > > > >     - we merge extents and now have just 1 uninitialized extent for range
> > > > > >       0-8192
> > > > > > 					ext4_convert_unwritten_extents() now
> > > > > > 					  has to split the extent to finish
> > > > > > 					  the IO.
> > > > > 
> > > > > Ah, I see.  Disabling the the merging that might take place as a
> > > > > result of the fallocate.  Yes, I agree that's a completely sane thing
> > > > > to do.
> > > > > 
> > > > > The alternate approach would be to add a flag in the extent status
> > > > > tree indicating that an unwritten conversion is pending, but that
> > > > > would add more complexity.
> > > > 
> > > > Sorry for delay reply.  Indeed we could add a flag in extent status tree
> > > > to indicate an pending unwritten extent, and I believe that it can bring
> > > > us some benefits.  But I wonder whether this case often happens.  Do we
> > > > have some real workloads?
> > >   It doesn't happen often but it *can* happen. Thus you have to implement
> > > a code which handles the case. I don't think bit in extent status tree is
> > > really necessary. Just disabling merging of uninitialized extents is
> > > simple. If we see there are some real workloads which have problems with
> > > it, we can resort to a more complex solution using extent tree...
> > 
> > Thanks for your explanation.  I don't know whether or not you have
> > generated a patch for this problem.  I am willing to make it in a proper
> > time.  If you have begun to generate it, please let me know. :-)
>   Disabling the merging is trivial and I have a patch for that. Just making
> all other changes so that Christoph's DIO patches can work is non-trivial.
> I already have several smaller fixes and cleanups to make things easier but
> writeback path still has locking issues - I have a solution in mind but
> whether it will be needed or not depends on what I asked in the other email
> - whether extent status tree can really be used or not...

Yeah, I have replied the mail in the other thread.  Thanks for your
time.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-01-01  4:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-21  1:25 Uninitialized extent races Jan Kara
2012-12-21  3:11 ` Theodore Ts'o
2012-12-21 16:19   ` Jan Kara
2012-12-21 18:02     ` Theodore Ts'o
2012-12-21 22:49       ` Jan Kara
2012-12-21 23:03         ` Theodore Ts'o
2012-12-24 11:17       ` Zheng Liu
2012-12-31  8:32         ` Jan Kara
2012-12-31 16:31           ` Zheng Liu
2012-12-31 16:44             ` Jan Kara
2013-01-01  4:49               ` Zheng Liu
2012-12-21 12:34 ` Dmitry Monakhov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.