linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Device removal crash problems
@ 2016-06-13  1:26 Anton Altaparmakov
  2016-06-15 13:07 ` Christoph Hellwig
  2016-07-08  1:03 ` NeilBrown
  0 siblings, 2 replies; 3+ messages in thread
From: Anton Altaparmakov @ 2016-06-13  1:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Jens Axboe, Theodore Ts'o, linux-fsdevel, Sougata Santra

Hi Christoph,

I think the reason the storage unplug crashes came back in 4.1 kernel after your work in 4.0 kernel to fix them is this commit: 6cd18e711dd8 "block: destroy bdi before blockdev is unregistered."

The fix was to basically violate the lifetime rules/reference counting you put in place and destroy the bdi before the reference count reaches zero which means we are back at square one!  The whole point of the reference count was specifically so that devices are not destroyed before the reference count becomes zero.  Or at least that was my understanding/assumption...

The solution should have perhaps been to fix MD and Loop drivers rather than to break the entire kernel all over again and then patch up ext4 again (commit bdfe0cbd746aa9b2509c2f6d6be17193cf7facd7).

The check in ext4 is not perfect because it is a race condition - if you unplug at same time as the check is happening you can still get the kernel to crash.  I grant you it is a very small race window but it is there.

What do you think?

Best regards,

	Anton
-- 
Anton Altaparmakov <anton at tuxera.com> (replace at with @)
Lead in File System Development, Tuxera Inc., http://www.tuxera.com/
Linux NTFS maintainer


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Device removal crash problems
  2016-06-13  1:26 Device removal crash problems Anton Altaparmakov
@ 2016-06-15 13:07 ` Christoph Hellwig
  2016-07-08  1:03 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: Christoph Hellwig @ 2016-06-15 13:07 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Christoph Hellwig, NeilBrown, Jens Axboe, Theodore Ts'o,
	linux-fsdevel, Sougata Santra

On Mon, Jun 13, 2016 at 01:26:24AM +0000, Anton Altaparmakov wrote:
> Hi Christoph,
> 
> I think the reason the storage unplug crashes came back in 4.1 kernel after your work in 4.0 kernel to fix them is this commit: 6cd18e711dd8 "block: destroy bdi before blockdev is unregistered."
> 
> The fix was to basically violate the lifetime rules/reference counting you put in place and destroy the bdi before the reference count reaches zero which means we are back at square one!  The whole point of the reference count was specifically so that devices are not destroyed before the reference count becomes zero.  Or at least that was my understanding/assumption...

Yes, that area is a mess.  Unfortunately I've been too busy to fight
in in detail - since then the writeback code also got trainwrecked by
the cgroups writeback support and it will take a long time for me to
get back into it.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Device removal crash problems
  2016-06-13  1:26 Device removal crash problems Anton Altaparmakov
  2016-06-15 13:07 ` Christoph Hellwig
@ 2016-07-08  1:03 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: NeilBrown @ 2016-07-08  1:03 UTC (permalink / raw)
  To: Anton Altaparmakov, Christoph Hellwig
  Cc: Jens Axboe, Theodore Ts'o, linux-fsdevel, Sougata Santra

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

On Mon, Jun 13 2016, Anton Altaparmakov wrote:

> Hi Christoph,
>
> I think the reason the storage unplug crashes came back in 4.1 kernel after your work in 4.0 kernel to fix them is this commit: 6cd18e711dd8 "block: destroy bdi before blockdev is unregistered."
>
> The fix was to basically violate the lifetime rules/reference counting you put in place and destroy the bdi before the reference count reaches zero which means we are back at square one!  The whole point of the reference count was specifically so that devices are not destroyed before the reference count becomes zero.  Or at least that was my understanding/assumption...
>
> The solution should have perhaps been to fix MD and Loop drivers rather than to break the entire kernel all over again and then patch up ext4 again (commit bdfe0cbd746aa9b2509c2f6d6be17193cf7facd7).
>
> The check in ext4 is not perfect because it is a race condition - if you unplug at same time as the check is happening you can still get the kernel to crash.  I grant you it is a very small race window but it is there.
>
> What do you think?

Is this problem fixed by

Commit: b02176f30cd3 ("block: don't release bdi while request_queue has live references")

(in 4.3-rc7)?

With that patch the unregistering is done early enough for md and loop,
but the freeing should be done late enough to not inconvenience
filesystems.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-07-08  1:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13  1:26 Device removal crash problems Anton Altaparmakov
2016-06-15 13:07 ` Christoph Hellwig
2016-07-08  1:03 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).