linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Dave T <davestechshop@gmail.com>
Cc: Qu Wenruo <wqu@suse.com>, Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: bad file extent, some csum missing - how to check that restored volumes are error-free?
Date: Sat, 17 Jul 2021 07:06:33 +0800	[thread overview]
Message-ID: <e42fcd8e-23d4-ee98-aab6-2210e408ad3f@gmx.com> (raw)
In-Reply-To: <CAGdWbB59w+5=3AoKU0uRHHkA1zeya0cRhqRn8sDYpea+hZOunA@mail.gmail.com>



On 2021/7/16 下午11:40, Dave T wrote:
> On Fri, Jul 16, 2021 at 9:28 AM Qu Wenruo <quwenruo.btrfs@gmx.com>
>>> I can do more testing and let you know. Can you suggest any tests you
>>> would like me to try?
>>
>> 1. Try to read the affected file:
>>
>> - Mount the btrfs
>>
>> - Read inode 262 in root 329 (just one example)
>>     You can use "find -inum 262" inside root 329 to locate the file.
>>
>
> I have reconnected and mounted the affected SSD.
> Most of the csum errors reported are for root 334 like this:
>
> root 334 inode 184115 errors 1040, bad file extent, some csum missing
> root 334 inode 184116 errors 1040, bad file extent, some csum missing
> There are hundreds of similar error lines.
>
> There were only a few for root 329 and one for 330.
>
> What is the method to map root 334, for example, to a file system
> path? Is it like this?
>
> # btrfs su li /
> ID 257 gen 1106448 top level 5 path @root
> ...
> ID 329 gen 1105905 top level 326 path @home/live/snapshot/user1/.cache
> ID 330 gen 1105905 top level 326 path @home/live/snapshot/user2/.cache
> ID 331 gen 1105905 top level 326 path @home/live/snapshot/user3/.cache
> ID 332 gen 1105905 top level 326 path @home/live/snapshot/user4.cache
> ID 333 gen 1105905 top level 326 path @home/live/snapshot/user5/.cache
> ID 334 gen 1105905 top level 326 path @home/live/snapshot/user6/.cache
>
> # cd /home/user6/.cache
> # find . -inum 184116
> ./mozilla/firefox/profile1/cache2/entries/3E5DF2A295E7D36F537DFDC221EBD6153F46DC30
>
> Did I do that correctly?

Yes, you're doing it correctly.

>
> # less ./mozilla/firefox/profile1/cache2/entries/3E5DF2A295E7D36F537DFDC221EBD6153F46DC30
> "./mozilla/firefox/profile1/cache2/entries/3E5DF2A295E7D36F537DFDC221EBD6153F46DC30"
> may be a binary file.  See it anyway?
>
> I viewed it and there are no errors in terminal or systemd journal
> when reading it.
>
> Next I tested every reported inode in root 334 (assuming I identified
> the root correctly) using this method:
>
> find /home/user6/.cache/ -inum 184874 -exec bash -c 'cp "{}" /tmp ;
> out=$(basename "{}"); rm /tmp/$out' \;
>
> I got a list of every inode number (e.g., 184874) from the output of
> my prior checks and looped through them all. No errors were reported.
>
> I do not see any related errors in dmesg either.


That's the expected behavior.

So the original problem of failed to read is another problem.
>
> # dmesg | grep -i btrfs
> [  +0.032192] Btrfs loaded, crc32c=crc32c-intel, zoned=yes
> [  +0.000546] BTRFS: device label top_level devid 1 transid 1106559
> /dev/dm-0 scanned by systemd-udevd (120)
> [  +0.029879] BTRFS info (device dm-0): disk space caching is enabled
> [  +0.000003] BTRFS info (device dm-0): has skinny extents
> [  +0.096620] BTRFS info (device dm-0): enabling ssd optimizations
> [  +0.002567] BTRFS info (device dm-0): enabling auto defrag
> [  +0.000005] BTRFS info (device dm-0): use lzo compression, level 0
> [  +0.000005] BTRFS info (device dm-0): disk space caching is enabled
> [  +0.044004] BTRFS info (device dm-0): devid 1 device path
> /dev/mapper/root changed to /dev/dm-0 scanned by systemd-udevd (275)
> [  +0.000829] BTRFS info (device dm-0): devid 1 device path /dev/dm-0
> changed to /dev/mapper/root scanned by system
>
> The only other FS-related messages in dmesg are:
>
> [  +0.142425] FS-Cache: Netfs 'nfs' registered for caching
> [  +0.018228] Key type dns_resolver registered
> [  +0.194893] NFS: Registering the id_resolver key type
> [  +0.000016] Key type id_resolver registered
> [  +0.000001] Key type id_legacy registered
> [  +0.022450] FS-Cache: Duplicate cookie detected
>
> If I have done that correctly, it raises some interesting questions.
> First, I started using a btrfs subvolume for user .cache directories
> in late 2018. I do this:
>
> users_list="user1 user2 user3 ... userN"
> for uu in $users_list; do \
>    btrfs su cr $destination/@home/live/snapshot/${uu}/.cache
>      chattr +C $destination/@home/live/snapshot/${uu}/.cache
>      chown ${uu}:${uu} $destination/@home/live/snapshot/${uu}/.cache
> done
>
> The reason is to not include the cache contents in snapshots & backups.
>
> The user6 user has apparently not logged into this particular device
> since May 15, 2019. (It is now used primarily by someone else.) The
> files in /home/user6/.cache appear to all have dates equal or prior to
> May 15, 2019, but no older than Feb 3, 2019. The vast majority of the
> reported errors were in these files. However, I do not see errors when
> accessing those files now.

So far so good, every thing is working as expected.

Just the btrfs-check is a little paranoid.

BTW, despite the bad file extent and csum missing error, is there any
other error reported from btrfs check?

>
>> - Find a way to reproduce the read-only fs problem
>
> This happened when I was using btrbk to send|receive snapshots to a
> target via ssh. I do not think it is a coincidence that I was doing a
> btrfs operation at the time this happened.
>
> I did the same btrbk operation on another device (a ThinkPad T450
> laptop) that has been running Arch Linux and BTRFS since many years
> ago (probably around 2015). However, the btrbk operation succeeded
> with no errors.
>
> Here is exactly what I did when the read-only problem first happened:
>
> # btrbk dryrun
> --------------------------------------------------------------------------------
> Backup Summary (btrbk command line client, version 0.31.2)
>
>      Date:   Tue Jul 13 23:11:32 2021
>      Config: /etc/btrbk/btrbk.conf
>      Dryrun: YES
>
> Legend:
>      ===  up-to-date subvolume (source snapshot)
>      +++  created subvolume (source snapshot)
>      ---  deleted subvolume
>      ***  received subvolume (non-incremental)
>      >>>  received subvolume (incremental)
> --------------------------------------------------------------------------------
> /mnt/top_level/@root/live/snapshot
> +++ /mnt/top_level/@root/_btrbk_snap/root.20210713T231132-0400
> *** backupsrv:/backup/clnt/laptop2/@root/root.20210713T231132-0400
>
> /mnt/top_level/@home/live/snapshot
> +++ /mnt/top_level/@home/_btrbk_snap/home.20210713T231132-0400
> *** backupsrv:/backup/clnt/laptop2/@home/home.20210713T231132-0400
>
> /mnt/top_level/@logs/live/snapshot
> +++ /mnt/top_level/@logs/_btrbk_snap/vlog.20210713T231132-0400
> *** backupsrv:/backup/clnt/laptop2/@log/vlog.20210713T231132-0400
>
> NOTE: Dryrun was active, none of the operations above were actually executed!
>
> # systemctl disable --now snapper-timeline.timer
>
> # systemctl enable --now btrbk.timer
> Created symlink /etc/systemd/system/timers.target.wants/btrbk.timer →
> /usr/lib/systemd/system/btrbk.timer.
>
> # systemctl list-timers --all
> NEXT                        LEFT        LAST
> PASSED        UNIT                         ACTIVATES
> Wed 2021-07-14 00:00:00 EDT 47min left  n/a
> n/a           btrbk.timer                  btrbk.service
> Wed 2021-07-14 00:00:00 EDT 47min left  Tue 2021-07-13 09:05:48 EDT
> 14h ago       logrotate.timer              logrotate.service
> Wed 2021-07-14 00:00:00 EDT 47min left  Tue 2021-07-13 09:05:48 EDT
> 14h ago       man-db.timer                 man-db.service
> Wed 2021-07-14 00:00:00 EDT 47min left  Tue 2021-07-13 09:05:48 EDT
> 14h ago       shadow.timer                 shadow.service
> Wed 2021-07-14 17:31:57 EDT 18h left    Tue 2021-07-13 17:31:57 EDT 5h
> 40min ago  snapper-cleanup.timer        snapper-cleanup.service
> Wed 2021-07-14 17:36:17 EDT 18h left    Tue 2021-07-13 17:36:17 EDT 5h
> 36min ago  systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
> Mon 2021-07-19 01:11:06 EDT 5 days left Mon 2021-07-12 01:24:15 EDT 1
> day 21h ago fstrim.timer                 fstrim.service
>
> 7 timers listed.
>
> # systemctl start btrbk.service
>
> # systemctl status btrbk
> ○ btrbk.service - btrbk backup
>       Loaded: loaded (/usr/lib/systemd/system/btrbk.service; static)
>      Drop-In: /etc/systemd/system/btrbk.service.d
>               └─override.conf
>       Active: inactive (dead) since Tue 2021-07-13 23:17:54 EDT; 20s ago
> TriggeredBy: ● btrbk.timer
>         Docs: man:btrbk(1)
>      Process: 6827 ExecStart=/usr/local/bin/btrbk_run.sh (code=exited,
> status=0/SUCCESS)
>     Main PID: 6827 (code=exited, status=0/SUCCESS)
>          CPU: 2min 40.794s
>
> # mount /mnt/top_level/
> mount: /mnt/top_level: wrong fs type, bad option, bad superblock on
> /dev/mapper/root, missing codepage or helper program, or other error.
>
> # ls /mnt/top_level/
> total 0
> drwxr-x--x 1 root root   0 Nov  1  2017 .
> drwxr-xr-x 1 root root 116 Apr 10  2020 ..
>
> My prompt includes a timestamp like this:
>
>   !2813 [13-Jul 23:19:18] root@laptop2
> # journalctl -r
> An error was encountered while opening journal file or directory
> /var/log/journal/7db5321aaf884af786868ec2f2e9c7b0/system.journal,
> ignoring file: Input/output error
> -- Journal begins at Thu 2021-06-17 15:14:31 EDT, ends at Tue
> 2021-07-13 16:19:12 EDT. --
> Jul 13 16:19:12 laptop2 sudo[674]: pam_unix(sudo:session): session
> opened for user root(uid=0) by user0(uid=1000)
>
> As far as I can tell, the last 7 hours of the journal are missing at that point.
>
> That's exactly how the read-only problem happened. I did a btrbk
> dryrun to validate the configuration. Then I started the backup. Near
> (or at) the end of the backup for the root subvolume, the backup
> process exited, but I could not see the journal entries for it because
> they were missing and the filesystem was read-only.

It's a pity that we didn't get the dmesg of that RO event, it should
contain the most valuable info.

But at least so far your old fs is pretty fine, you can continue using it.

Thanks,
Qu

>

  reply	other threads:[~2021-07-16 23:06 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-14 17:53 bad file extent, some csum missing - how to check that restored volumes are error-free? Dave T
2021-07-14 22:51 ` Qu Wenruo
     [not found]   ` <CAGdWbB44nH7dgdP3qO_bFYZwbkrW37OwFEVTE2Bn+rn4d7zWiQ@mail.gmail.com>
     [not found]     ` <43e7dc04-c862-fff1-45af-fd779206d71c@gmx.com>
     [not found]       ` <CAGdWbB7Q98tSbPgHUBF+yjqYRBPZ-a42hd=xLwMZUMO46gfd0A@mail.gmail.com>
2021-07-15 22:19         ` Dave T
2021-07-15 22:30           ` Qu Wenruo
2021-07-15 22:49             ` Dave T
2021-07-16  1:05               ` Qu Wenruo
2021-07-16  2:32                 ` Qu Wenruo
2021-07-16 13:15                 ` Dave T
2021-07-16 13:28                   ` Qu Wenruo
2021-07-16 15:40                     ` Dave T
2021-07-16 23:06                       ` Qu Wenruo [this message]
2021-07-17  0:18                         ` Dave T
2021-07-17  0:25                           ` Qu Wenruo
2021-07-17  0:57                             ` Dave T
2021-07-17  0:59                               ` Qu Wenruo
2021-07-25 17:34                                 ` Dave T
2021-07-25 23:51                                   ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e42fcd8e-23d4-ee98-aab6-2210e408ad3f@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=davestechshop@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).