linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@gmail.com>
To: Philipp Fent <fent@in.tum.de>
Cc: Wang Yugui <wangyugui@e16-tech.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Leaf corruption due to csum range
Date: Tue, 11 May 2021 13:35:31 +0100	[thread overview]
Message-ID: <CAL3q7H6WmvatgNpGA6pqPBfe6TjPViwwCJo=wrjBOZRN0q0LuQ@mail.gmail.com> (raw)
In-Reply-To: <ad414944-2418-3728-ac1a-5d4d37e37ac1@in.tum.de>

On Tue, May 11, 2021 at 12:35 PM Philipp Fent <fent@in.tum.de> wrote:
>
> On 11.05.21 10:18, Wang Yugui wrote:
> > Is this a server with ECC memory?
>
> My machine does not have ECC memory. So I guess a random bitflip is a
> possibility, but I would rule this out. Memtester didn't report any
> errors in an extended test, and I run quite a lot of write heavy
> databases like sql server. This sql server workload is the only
> application where I ever got csum errors. And I really get them *every
> time* I run it. I think is more likely an issue in sql server's write
> pattern.
>
>
> On 11.05.21 10:56, Filipe Manana wrote:
> > Most likely it's a race when adding checksums. In this case for the
> > log tree (fsync).
> > Try to see if there are reflink operations (clone and dedupe) done by
> > sql server (or maybe docker), in case there aren't, that excludes
> > shared extents being the cause of the problem.
>
> I now also ran the sql server docker container under sysdig, which gave
> me the following breakdown of executed system calls of the container:
> CALLS/S  TOT TIME  AVG TIME SYSCALL
>   13.05   702.41s     420ms futex
>   12.11     193us     124ns gettid
>    7.55    1.05ms    1.09us mprotect
>    6.55      70ms      84us nanosleep
>    5.70     128us     175ns getcpu
>    5.44    1.04ms    1.49us read
>    5.41    1.18ms    1.71us mmap
>    5.25    1.99ms    2.96us openat
>    4.44     122us     215ns close
>    4.23     308us     568ns fstat
>    1.50     298us    1.55us access
>    1.11      21us     153ns getpid
>    1.08      38us     279ns rt_sigprocmask
>    0.94      41us     343ns lseek
>    0.91     542us    4.67us munmap
>    0.77      25us     260ns sigaltstack
>    0.77      22us     226ns set_robust_list
>    0.70     616us    6.92us clone
>    0.55     653us    9.33us sendto
>    0.53    88.96s     1.30s recvfrom
>    0.44     867us      15us madvise
>    0.44      11us     198ns rt_sigaction
>
> Filipe's suspicion of a race condition might be a good lead. There are
> several clone operations in the trace and the majority of runtime seems

Not familiar with sysdig myself, but those clone calls are likely to
be the system call for creating process [1], and not the ioctl for
cloning [2] or dedupe [3].

strace would be clear to me, which I'm more familiar with (or even
better, bpftrace).

Also, seeing that mmap is in the log, I just remembered that 5.13-rc1
includes a fix for races between mmap writes and fsync that could fix
that:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=885f46d87f29a94eafe3cc707d5c4dea2be248f3

The changelog identifies logging file extent items with overlapping
ranges, as the result of the race, but thinking out loud now, I think
it could also result in logging checksum items with overlapping ranges
because of that.
If you want to test it, either try 5.13-rc1 or pick that patch and all
its dependencies into 5.12.x (the dependencies are listed at the end
of the changelog, all of them landed in 5.13-rc1).

I'll see if later today or tomorrow I can get the reproducer to run here.

[1] https://man7.org/linux/man-pages/man2/clone.2.html
[2] https://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html
[3] https://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html

> to be spent contending locks, so I assume there are multiple threads at
> work.
> I've attached the full sysdig log (~1MB), not sure if this helps.
>
> Is there a way to increase the btrfs log output, so I could try to
> observe which leafs are written? CONFIG_BTRFS_DEBUG looks like something
> I could enable. Is this the right approach to narrow down a race?

There's no way to log which leaves are written and dump their contents
on the fly.
That would produce tremendous amount of text to log.

Most likely this is a race condition that the sql server workload is
good for triggering (at least on your box).

Thanks

>
> Thanks.



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

  parent reply	other threads:[~2021-05-11 12:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-10 20:50 Leaf corruption due to csum range Philipp Fent
2021-05-11  8:18 ` Wang Yugui
2021-05-11  8:44   ` Qu Wenruo
2021-05-11  8:56 ` Filipe Manana
     [not found]   ` <ad414944-2418-3728-ac1a-5d4d37e37ac1@in.tum.de>
2021-05-11 12:35     ` Filipe Manana [this message]
     [not found]       ` <ef9ea56e-fb47-f719-137b-ffb545a09db7@in.tum.de>
2021-05-13  9:57         ` Filipe Manana
2021-05-13 10:50           ` Filipe Manana
2021-05-13 11:11             ` Philipp Fent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H6WmvatgNpGA6pqPBfe6TjPViwwCJo=wrjBOZRN0q0LuQ@mail.gmail.com' \
    --to=fdmanana@gmail.com \
    --cc=fent@in.tum.de \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).