After block device error, FICLONE and sync_file_range() make NULs, unlike read()

* After block device error, FICLONE and sync_file_range() make NULs, unlike read()
@ 2022-11-08 17:24 Noah Misch
  2022-11-09 16:47 ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Noah Misch @ 2022-11-08 17:24 UTC (permalink / raw)
  To: linux-xfs

Scenario: due to a block device error, the kernel fails to persist some file
content.  Even so, read() always returns the file content accurately.  The
first FICLONE returns EIO, but every subsequent FICLONE or copy_file_range()
operates as though the file were all zeros.  How feasible is it change FICLONE
and copy_file_range() such that they instead find the bytes that read() finds?

- Kernel is 6.0.0-1-sparc64-smp from Debian sid, running in a Solaris-hosted VM.

- The VM is gcc202 from https://cfarm.tetaneutral.net/machines/list/.
  Accounts are available.

- The outcome is still reproducible in FICLONE issued two days after the
  original block device error.  I haven't checked whether it survives a
  reboot.

- The "sync" command did not help.

- The block device errors have been ongoing for years.  If curious, see
  https://postgr.es/m/CA+hUKGKfrXnuyk0Z24m8x4_eziuC3kLSaCmEeKPO1DVU9t-qtQ@mail.gmail.com
  for details.  (Fixing the sunvdc driver is out of scope for this thread.)
  Other known symptoms are failures in truncate() and fsync().  The system has
  been generally usable for applications not requiring persistence.  I saw the
  FICLONE problem after the system updated coreutils from 8.32-4.1 to 9.1-1.
  That introduced a "cp" that uses FICLONE.  My current workaround is to place
  a "cp" in my PATH that does 'exec /usr/bin/cp --reflink=never "$@"'

The trouble emerged at a "cp".  To capture more details, I replaced "cp" with
"trace-cp" containing:

  sum "$1"
  strace cp "$@" 2>&1 | sed -n '/^geteuid/,$p'
  sum "$2"

Output from that follows.  FICLONE returns EIO.  "cp" then falls back to
copy_file_range(), which yields an all-zeros file:

  47831 16384 pg_wal/000000030000000000000003
  geteuid()                               = 1450
  openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
  fstatat64(AT_FDCWD, "pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
  openat(AT_FDCWD, "pg_wal/000000030000000000000003", O_RDONLY) = 4
  fstatat64(4, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
  openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_WRONLY|O_CREAT|O_EXCL, 0600) = 5
  ioctl(5, BTRFS_IOC_CLONE or FICLONE, 4) = -1 EIO (Input/output error)
  fstatat64(5, "", {st_mode=S_IFREG|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
  fadvise64_64(4, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
  copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 16777216
  copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 0
  close(5)                                = 0
  close(4)                                = 0
  _llseek(0, 0, [0], SEEK_CUR)            = 0
  close(0)                                = 0
  close(1)                                = 0
  close(2)                                = 0
  exit_group(0)                           = ?
  +++ exited with 0 +++
  00000 16384 /home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003

Subsequent FICLONE returns 0 and yields an all-zeros file.  Test script:

  set -x
  broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  dest=$HOME/tmp/discard
  sum "$broken_source"
  : 'FICLONE returns 0 and yields an all-zeros file'
  strace cp --reflink=always "$broken_source" "$dest" 2>&1 | sed -n '/^geteuid/,$p'
  sum "$dest"; rm "$dest"
  : 'copy_file_range() returns 0 and yields an all-zeros file'
  strace -e copy_file_range cat "$broken_source" >"$dest"
  sum "$dest"; rm "$dest"
  : 'read() gets the intended bytes'
  cat "$broken_source" | cat >"$dest"
  sum "$dest"; rm "$dest"

Test script output:

  + broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  + dest=/home/nm/tmp/discard
  + sum t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  49522 16384 t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  + : FICLONE returns 0 and yields an all-zeros file
  + strace cp --reflink=always t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003 /home/nm/tmp/discard
  + sed -n /^geteuid/,$p
  geteuid()                               = 1450
  openat(AT_FDCWD, "/home/nm/tmp/discard", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
  fstatat64(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
  openat(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", O_RDONLY) = 3
  fstatat64(3, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
  openat(AT_FDCWD, "/home/nm/tmp/discard", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = 0
  close(4)                                = 0
  close(3)                                = 0
  _llseek(0, 0, 0x7feffddf1c0, SEEK_CUR)  = -1 ESPIPE (Illegal seek)
  close(0)                                = 0
  close(1)                                = 0
  close(2)                                = 0
  exit_group(0)                           = ?
  +++ exited with 0 +++
  + sum /home/nm/tmp/discard
  00000 16384 /home/nm/tmp/discard
  + rm /home/nm/tmp/discard
  + : copy_file_range() returns 0 and yields an all-zeros file
  + strace -e copy_file_range cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 16777216
  copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 0
  +++ exited with 0 +++
  + sum /home/nm/tmp/discard
  00000 16384 /home/nm/tmp/discard
  + rm /home/nm/tmp/discard
  + : read() gets the intended bytes
  + cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  + cat
  + sum /home/nm/tmp/discard
  49522 16384 /home/nm/tmp/discard
  + rm /home/nm/tmp/discard

^ permalink raw reply	[flat|nested] 8+ messages in thread