All of lore.kernel.org
 help / color / mirror / Atom feed
* Reducing ext4 fs issues resulting from frequent hard poweroffs
       [not found] <CAPA0+rx8eLJU6j1uus2bBY63SrY_WC4TU_WTy0MoXk031wNjJw@mail.gmail.com>
@ 2020-05-12 21:08 ` Julio Lajara
  2020-05-12 22:01   ` Theodore Y. Ts'o
  2020-05-13  3:16   ` Eric Sandeen
  0 siblings, 2 replies; 3+ messages in thread
From: Julio Lajara @ 2020-05-12 21:08 UTC (permalink / raw)
  To: linux-ext4

Hi all, I currently manage an IOT fleet based on Intel NUCs running
Ubuntu 18.04 Server on SSDs with etx4, no swap. The device usage is
more CPU bound than I/O bound and we are having some issues keeping a
subset of devices running due to them being hard powered off in the
field in some regions (sometimes as frequently as every 12hrs). Due to
current difficulties in getting devices back from the field I'm
looking into tweaking them as best as possible to survive these hard
power off barring any physical SSD issues.

Currently I have tried tweaking some ext4 and I/O settings with the following:

* kernel options:
  elevator=noop fsck.mode=force fsck.repair=yes

* fstab ext4 specific mount options:
  commit=1,max_batch_time=0

Are there any other configuration settings or changes to the above
that would make sense to try here for this use case? I am hoping to at
least make the fsck repair the last line of defence so it doesnt get
stuck waiting for a prompt to repair it at boot, but want to try to
change the I/O / ext4 behavior if possible so its writing as
frequently as sanely possible to try to reduce the frequency where
fsck is actually needed.

Thanks,

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Reducing ext4 fs issues resulting from frequent hard poweroffs
  2020-05-12 21:08 ` Reducing ext4 fs issues resulting from frequent hard poweroffs Julio Lajara
@ 2020-05-12 22:01   ` Theodore Y. Ts'o
  2020-05-13  3:16   ` Eric Sandeen
  1 sibling, 0 replies; 3+ messages in thread
From: Theodore Y. Ts'o @ 2020-05-12 22:01 UTC (permalink / raw)
  To: julio.lajara; +Cc: linux-ext4

On Tue, May 12, 2020 at 05:08:51PM -0400, Julio Lajara wrote:
> Hi all, I currently manage an IOT fleet based on Intel NUCs running
> Ubuntu 18.04 Server on SSDs with etx4, no swap. The device usage is
> more CPU bound than I/O bound and we are having some issues keeping a
> subset of devices running due to them being hard powered off in the
> field in some regions (sometimes as frequently as every 12hrs). Due to
> current difficulties in getting devices back from the field I'm
> looking into tweaking them as best as possible to survive these hard
> power off barring any physical SSD issues.

Hi Julio,

If the hardware devices are behaving appropriately --- that is, after
receiving a CACHE FLUSH command the storage device persists all blocks
written up to the CACHE FLUSH command, such that when the OS receives
the command completion notification of the CACHE FLUSH, everything is
persisted even after a hard power off --- no special configuration
should be necessary.

We have regression tests which simulate this and ext4 regularly passes
them.

If you need to tweak settings, that's an indication that your hardware
is buggy.  And unfortunately ,there's not much we can do to prevent
failures.  A lot is going to depend on *how* crappy the SSD's happen
to be.

Your best bet might be to find a way to make your root filesystem
read-only, so it's not being modified at all, and then set up a
scratch partition with state which can be reformatted at any time if
it gets corrupted --- and then try to get all of your date pushed out
to your remote servers / cloud as often as possible.  And next time,
qualify the SSD's ahead of time to make sure they aren't overly "cost
optimized" (read: crap) before you buy your fleet of devices.  :-(

	   	  	       	       - Ted

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Reducing ext4 fs issues resulting from frequent hard poweroffs
  2020-05-12 21:08 ` Reducing ext4 fs issues resulting from frequent hard poweroffs Julio Lajara
  2020-05-12 22:01   ` Theodore Y. Ts'o
@ 2020-05-13  3:16   ` Eric Sandeen
  1 sibling, 0 replies; 3+ messages in thread
From: Eric Sandeen @ 2020-05-13  3:16 UTC (permalink / raw)
  To: julio.lajara, linux-ext4

On 5/12/20 4:08 PM, Julio Lajara wrote:
> Hi all, I currently manage an IOT fleet based on Intel NUCs running
> Ubuntu 18.04 Server on SSDs with etx4, no swap. The device usage is
> more CPU bound than I/O bound and we are having some issues keeping a
> subset of devices running due to them being hard powered off in the
> field in some regions (sometimes as frequently as every 12hrs). Due to
> current difficulties in getting devices back from the field I'm
> looking into tweaking them as best as possible to survive these hard
> power off barring any physical SSD issues.

I don't think you've actually said what the failure mode after power
loss is, have you?

> Currently I have tried tweaking some ext4 and I/O settings with the following:
> 
> * kernel options:
>   elevator=noop fsck.mode=force fsck.repair=yes
> 
> * fstab ext4 specific mount options:
>   commit=1,max_batch_time=0
> 
> Are there any other configuration settings or changes to the above
> that would make sense to try here for this use case? I am hoping to at
> least make the fsck repair the last line of defence so it doesnt get
> stuck waiting for a prompt to repair it at boot, but want to try to
> change the I/O / ext4 behavior if possible so its writing as
> frequently as sanely possible to try to reduce the frequency where
> fsck is actually needed.

I can't tell from this why fsck is needed in the first place; what
actually goes wrong when power is lost?  Ted's right that properly
behaving hardware should not require any special attention after
power loss to restore filesystem consistency, but I can't tell for
sure what your actual root cause for boot failure is from this
email...

-Eric

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-05-13  3:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAPA0+rx8eLJU6j1uus2bBY63SrY_WC4TU_WTy0MoXk031wNjJw@mail.gmail.com>
2020-05-12 21:08 ` Reducing ext4 fs issues resulting from frequent hard poweroffs Julio Lajara
2020-05-12 22:01   ` Theodore Y. Ts'o
2020-05-13  3:16   ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.