From: Filipe Manana <fdmanana@gmail.com>
To: Philipp Fent <fent@in.tum.de>
Cc: Wang Yugui <wangyugui@e16-tech.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Leaf corruption due to csum range
Date: Tue, 11 May 2021 13:35:31 +0100 [thread overview]
Message-ID: <CAL3q7H6WmvatgNpGA6pqPBfe6TjPViwwCJo=wrjBOZRN0q0LuQ@mail.gmail.com> (raw)
In-Reply-To: <ad414944-2418-3728-ac1a-5d4d37e37ac1@in.tum.de>
On Tue, May 11, 2021 at 12:35 PM Philipp Fent <fent@in.tum.de> wrote:
>
> On 11.05.21 10:18, Wang Yugui wrote:
> > Is this a server with ECC memory?
>
> My machine does not have ECC memory. So I guess a random bitflip is a
> possibility, but I would rule this out. Memtester didn't report any
> errors in an extended test, and I run quite a lot of write heavy
> databases like sql server. This sql server workload is the only
> application where I ever got csum errors. And I really get them *every
> time* I run it. I think is more likely an issue in sql server's write
> pattern.
>
>
> On 11.05.21 10:56, Filipe Manana wrote:
> > Most likely it's a race when adding checksums. In this case for the
> > log tree (fsync).
> > Try to see if there are reflink operations (clone and dedupe) done by
> > sql server (or maybe docker), in case there aren't, that excludes
> > shared extents being the cause of the problem.
>
> I now also ran the sql server docker container under sysdig, which gave
> me the following breakdown of executed system calls of the container:
> CALLS/S TOT TIME AVG TIME SYSCALL
> 13.05 702.41s 420ms futex
> 12.11 193us 124ns gettid
> 7.55 1.05ms 1.09us mprotect
> 6.55 70ms 84us nanosleep
> 5.70 128us 175ns getcpu
> 5.44 1.04ms 1.49us read
> 5.41 1.18ms 1.71us mmap
> 5.25 1.99ms 2.96us openat
> 4.44 122us 215ns close
> 4.23 308us 568ns fstat
> 1.50 298us 1.55us access
> 1.11 21us 153ns getpid
> 1.08 38us 279ns rt_sigprocmask
> 0.94 41us 343ns lseek
> 0.91 542us 4.67us munmap
> 0.77 25us 260ns sigaltstack
> 0.77 22us 226ns set_robust_list
> 0.70 616us 6.92us clone
> 0.55 653us 9.33us sendto
> 0.53 88.96s 1.30s recvfrom
> 0.44 867us 15us madvise
> 0.44 11us 198ns rt_sigaction
>
> Filipe's suspicion of a race condition might be a good lead. There are
> several clone operations in the trace and the majority of runtime seems
Not familiar with sysdig myself, but those clone calls are likely to
be the system call for creating process [1], and not the ioctl for
cloning [2] or dedupe [3].
strace would be clear to me, which I'm more familiar with (or even
better, bpftrace).
Also, seeing that mmap is in the log, I just remembered that 5.13-rc1
includes a fix for races between mmap writes and fsync that could fix
that:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=885f46d87f29a94eafe3cc707d5c4dea2be248f3
The changelog identifies logging file extent items with overlapping
ranges, as the result of the race, but thinking out loud now, I think
it could also result in logging checksum items with overlapping ranges
because of that.
If you want to test it, either try 5.13-rc1 or pick that patch and all
its dependencies into 5.12.x (the dependencies are listed at the end
of the changelog, all of them landed in 5.13-rc1).
I'll see if later today or tomorrow I can get the reproducer to run here.
[1] https://man7.org/linux/man-pages/man2/clone.2.html
[2] https://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html
[3] https://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html
> to be spent contending locks, so I assume there are multiple threads at
> work.
> I've attached the full sysdig log (~1MB), not sure if this helps.
>
> Is there a way to increase the btrfs log output, so I could try to
> observe which leafs are written? CONFIG_BTRFS_DEBUG looks like something
> I could enable. Is this the right approach to narrow down a race?
There's no way to log which leaves are written and dump their contents
on the fly.
That would produce tremendous amount of text to log.
Most likely this is a race condition that the sql server workload is
good for triggering (at least on your box).
Thanks
>
> Thanks.
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
next prev parent reply other threads:[~2021-05-11 12:35 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-10 20:50 Leaf corruption due to csum range Philipp Fent
2021-05-11 8:18 ` Wang Yugui
2021-05-11 8:44 ` Qu Wenruo
2021-05-11 8:56 ` Filipe Manana
[not found] ` <ad414944-2418-3728-ac1a-5d4d37e37ac1@in.tum.de>
2021-05-11 12:35 ` Filipe Manana [this message]
[not found] ` <ef9ea56e-fb47-f719-137b-ffb545a09db7@in.tum.de>
2021-05-13 9:57 ` Filipe Manana
2021-05-13 10:50 ` Filipe Manana
2021-05-13 11:11 ` Philipp Fent
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAL3q7H6WmvatgNpGA6pqPBfe6TjPViwwCJo=wrjBOZRN0q0LuQ@mail.gmail.com' \
--to=fdmanana@gmail.com \
--cc=fent@in.tum.de \
--cc=linux-btrfs@vger.kernel.org \
--cc=wangyugui@e16-tech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).