All of lore.kernel.org
 help / color / mirror / Atom feed
From: Clay Gerrard <clay.gerrard@gmail.com>
To: linux-xfs@vger.kernel.org
Subject: ENODATA on list/stat directory
Date: Thu, 23 Jun 2022 14:52:22 -0500	[thread overview]
Message-ID: <CA+_JKzo7V5PZkWGFPB5hP0pAtWrOsi0TomxHaO5W+ViEF8ctwQ@mail.gmail.com> (raw)

I work on an object storage system, OpenStack Swift, that has always
used xfs on the storage nodes.  Our system has encountered many
various disk failures and occasionally apparent file system corruption
over the years, but we've been noticing something lately that might be
"new" and I'm considering how to approach the problem.  I'm interested
to solicit critique on my current thinking/process - particularly from
xfs experts.

[root@s8k-sjc3-d01-obj-9 ~]# xfs_bmap
/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53
/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53:
No data available
[root@s8k-sjc3-d01-obj-9 ~]# xfs_db
/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53
/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53:
No data available

fatal error -- couldn't initialize XFS library
[root@s8k-sjc3-d01-obj-9 ~]# ls -alhF /srv/node/d21865/quarantined/objects-1/e53
ls: cannot access
/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53:
No data available
total 4.0K
drwxr-xr-x  9 swift swift  318 Jun  7 00:57 ./
drwxr-xr-x 33 swift swift 4.0K Jun 23 16:10 ../
d?????????  ? ?     ?        ?            ? f0418758de4baaa402eb301c5bae3e53/
drwxr-xr-x  2 swift swift   47 May 27 00:43 f04193c31edc9593007471ee5a189e53/
drwxr-xr-x  2 swift swift   47 May 27 00:43 f0419c711a5a5d01dac6154970525e53/
drwxr-xr-x  2 swift swift   47 May 27 00:43 f041a2548b9255493d16ba21c19b6e53/
drwxr-xr-x  2 swift swift   47 Jun  7 00:57 f041aa09d40566d6915a706a22886e53/
drwxr-xr-x  2 swift swift   39 May 27 00:43 f041ac88bf13e5458a049d827e761e53/
drwxr-xr-x  2 swift swift   47 May 27 00:43 f041bfd1c234d44b591c025d459a7e53/
[root@s8k-sjc3-d01-obj-9 ~]# python
Python 2.7.5 (default, Nov 16 2020, 22:23:17)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.stat('/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 61] No data available:
'/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53'
>>> os.listdir('/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 61] No data available:
'/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53'
>>>

[root@s8k-sjc3-d01-obj-9 ~]# uname -a
Linux s8k-sjc3-d01-obj-9.nsv.sjc3.nvmetal.net
3.10.0-1160.62.1.el7.x86_64 #1 SMP Tue Apr 5 16:57:59 UTC 2022 x86_64
x86_64 x86_64 GNU/Linux
[root@s8k-sjc3-d01-obj-9 ~]# mount | grep /srv/node/d21865
/dev/sdd on /srv/node/d21865 type xfs
(rw,noatime,nodiratime,attr2,inode64,logbufs=8,noquota)
[root@s8k-sjc3-d01-obj-9 ~]# xfs_db -r /dev/sdd
xfs_db> version
versionnum [0xbcb5+0x18a] =
V5,NLINK,DIRV2,ATTR,ALIGN,LOGV2,EXTFLG,SECTOR,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE
xfs_db>

We can't "do" anything with the directory once it starts giving us
ENODATA.  We don't typically like to unmount the whole filesystem
(there's a LOT of *uncorrupt* data on the device) - so I'm not 100%
sure if xfs_repair fixes these directories.  Swift itself is a
replicated/erasure-coded store - we can almost always "throw away"
corrupt data on a single node and the rest can bring the cluster state
back to full durability.

This particular failure is worrisome for two reasons:

1) we can't "just" delete the affected directory - because we can't
stat/move it - so we have to throw away the whole *parent* directory
(HUGE blast radius in some cases)
2) for 10 years running Swift i've never seen this exactly, and now it
seems to be happening more and more often - but we don't know if it's
a new software version, or new hardware revision, or new access
pattern

I'd also like to be able to "simulate" this kind of corruption on a
healthy filesystem so we can test our "quarantine/auditor" code that's
trying to move these filesystem problems out of the way for the
consistency engine.  Does anyone have any guess how I could MAKE an
xfs filesystem produce this kind of behavior on purpose?

--
Clay Gerrard

             reply	other threads:[~2022-06-23 19:52 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-23 19:52 Clay Gerrard [this message]
2022-06-23 22:13 ` ENODATA on list/stat directory Dave Chinner
2022-06-24  1:37   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+_JKzo7V5PZkWGFPB5hP0pAtWrOsi0TomxHaO5W+ViEF8ctwQ@mail.gmail.com \
    --to=clay.gerrard@gmail.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.