All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konstantin <newsbox1026@web.de>
To: Anand Jain <Anand.Jain@oracle.com>,
	kreijack@inwind.it, MegaBrutal <megabrutal@gmail.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Cc: Robert White <rwhite@pobox.com>, Chris Mason <clm@fb.com>
Subject: Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots
Date: Mon, 08 Dec 2014 01:05:16 +0100	[thread overview]
Message-ID: <5484EB3C.4000608@web.de> (raw)
In-Reply-To: <547DA86C.4050100@oracle.com>


Anand Jain wrote on 02.12.2014 at 12:54:
>
>
>
> On 02/12/2014 19:14, Goffredo Baroncelli wrote:
>> I further investigate this issue.
>>
>> MegaBrutal, reported the following issue: doing a lvm snapshot of the
>> device of a
>> mounted btrfs fs, the new snapshot device name replaces the name of
>> the original
>> device in the output of /proc/mounts. This confused tools like
>> grub-probe which
>> report a wrong root device.
>
> very good test case indeed thanks.
>
> Actual IO would still go to the original device, until FS is remounted.
This seems to be correct at least at the beginning but I wouldn't be so
sure - why else the system is crashing in my case after a while when the
second drive is present?! So if the kernel was not using it in some way,
except the wrong /proc/mounts nothing else should happen.

>
>> It has to be pointed out that instead the link under
>> /sys/fs/btrfs/<fsid>/devices is
>> correct.
>
> In this context the above sysfs path will be out of sync with the
> reality, its just stale sysfs entry.
>
>>
>> What happens is that *even if the filesystem is mounted*, doing a
>> "btrfs dev scan" of a snapshot (of the real volume), the device name
>> of the
>> filesystem is replaced with the snapshot one.
>
> we have some fundamentally wrong stuff. My original patch tried
> to fix it. But later discovered that some external entities like
> systmed and boot process is using that bug as a feature and we had
> to revert the patch.
>
> Fundamentally scsi inquiry serial number is only number which is unique
> to the device (including the virtual device, but there could be some
> legacy virtual device which didn't follow that strictly, Anyway those
> I deem to be device side issue.) Btrfs depends on the combination of
> fsid, uuid and devid (and generation number) to identify the unique
> device volume, which is weak and easy to go wrong.
>
>
>> Anand, with b96de000b, tried to fix it; however further regression
>> appeared
>> and Chris reverted this commit (see below).
>>
>> BR
>> G.Baroncelli
>>
>> commit b96de000bc8bc9688b3a2abea4332bd57648a49f
>> Author: Anand Jain <anand.jain@oracle.com>
>> Date:   Thu Jul 3 18:22:05 2014 +0800
>>
>>      Btrfs: device_list_add() should not update list when mounted
>> [...]
>>
>>
>> commit 0f23ae74f589304bf33233f85737f4fd368549eb
>> Author: Chris Mason <clm@fb.com>
>> Date:   Thu Sep 18 07:49:05 2014 -0700
>>
>>      Revert "Btrfs: device_list_add() should not update list when
>> mounted"
>>
>>      This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f.
>>
>>      This commit is triggering failures to mount by subvolume id in some
>>      configurations.  The main problem is how many different ways this
>>      scanning function is used, both for scanning while mounted and
>>      unmounted.  A proper cleanup is too big for late rcs.
>>
>> [...]
>>
>> On 12/02/2014 09:28 AM, MegaBrutal wrote:
>>> 2014-12-02 8:50 GMT+01:00 Goffredo Baroncelli <kreijack@inwind.it>:
>>>> On 12/02/2014 01:15 AM, MegaBrutal wrote:
>>>>> 2014-12-02 0:24 GMT+01:00 Robert White <rwhite@pobox.com>:
>>>>>> On 12/01/2014 02:10 PM, MegaBrutal wrote:
>>>>>>>
>>>>>>> Since having duplicate UUIDs on devices is not a problem for me
>>>>>>> since
>>>>>>> I can tell them apart by LVM names, the discussion is of little
>>>>>>> relevance to my use case. Of course it's interesting and I like to
>>>>>>> read it along, it is not about the actual problem at hand.
>>>>>>>
>>>>>>
>>>>>> Which is why you use the device= mount option, which would take
>>>>>> LVM names
>>>>>> and which was repeatedly discussed as solving this very problem.
>>>>>>
>>>>>> Once you decide to duplicate the UUIDs with LVM snapshots you
>>>>>> take up the
>>>>>> burden of disambiguating your storage.
>>>>>>
>>>>>> Which is part of why re-reading was suggested as this was covered
>>>>>> in some
>>>>>> depth and _is_ _exactly_ about the problem at hand.
>>>>>
>>>>> Nope.
>>>>>
>>>>> root@reproduce-1391429:~# cat /proc/cmdline
>>>>> BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
>>>>> root=/dev/mapper/vg-rootlv ro
>>>>> rootflags=device=/dev/mapper/vg-rootlv,subvol=@
>>>>>
>>>>> Observe, device= mount option is added.
>>>>
>>>> device= options is needed only in a btrfs multi-volume scenario.
>>>> If you have only one disk, this is not needed
>>>>
>>>
>>> I know. I only did this as a demonstration for Robert. He insisted it
>>> will certainly solve the problem. Well, it doesn't.
>>>
>>>
>>>>>
>>>>> root@reproduce-1391429:~# ./reproduce-1391429.sh
>>>>> #!/bin/sh -v
>>>>> lvs
>>>>>    LV     VG   Attr      LSize   Pool Origin Data%  Move Log
>>>>> Copy%  Convert
>>>>>    rootlv vg   -wi-ao---   1.00g
>>>>>    swap0  vg   -wi-ao--- 256.00m
>>>>>
>>>>> grub-probe --target=device /
>>>>> /dev/mapper/vg-rootlv
>>>>>
>>>>> grep " / " /proc/mounts
>>>>> rootfs / rootfs rw 0 0
>>>>> /dev/dm-1 / btrfs rw,relatime,space_cache 0 0
>>>>>
>>>>> lvcreate --snapshot --size=128M --name z vg/rootlv
>>>>>    Logical volume "z" created
>>>>>
>>>>> lvs
>>>>>    LV     VG   Attr      LSize   Pool Origin Data%  Move Log
>>>>> Copy%  Convert
>>>>>    rootlv vg   owi-aos--   1.00g
>>>>>    swap0  vg   -wi-ao--- 256.00m
>>>>>    z      vg   swi-a-s-- 128.00m      rootlv   0.11
>>>>>
>>>>> ls -l /dev/vg/
>>>>> total 0
>>>>> lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv -> ../dm-1
>>>>> lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 -> ../dm-0
>>>>> lrwxrwxrwx 1 root root 7 Dec  2 00:12 z -> ../dm-2
>>>>>
>>>>> grub-probe --target=device /
>>>>> /dev/mapper/vg-z
>>>>>
>>>>> grep " / " /proc/mounts
>>>>> rootfs / rootfs rw 0 0
>>>>> /dev/dm-2 / btrfs rw,relatime,space_cache 0 0
>>>>
>>>> What /proc/self/mountinfo contains ?
>>>
>>> Before creating snapshot:
>>>
>>> 15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
>>> 16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>>> 17 20 0:5 / /dev rw,relatime - devtmpfs udev
>>> rw,size=241692k,nr_inodes=60423,mode=755
>>> 18 17 0:12 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts
>>> rw,gid=5,mode=620,ptmxmode=000
>>> 19 20 0:16 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs
>>> rw,size=50084k,mode=755
>>> 20 0 0:17 /@ / rw,relatime - btrfs /dev/dm-1 rw,space_cache
>>> <----- THIS!
>>> 21 15 0:20 / /sys/fs/cgroup rw,relatime - tmpfs none
>>> rw,size=4k,mode=755
>>> 22 15 0:21 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
>>> 23 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
>>> 24 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
>>> 25 19 0:22 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=5120k
>>> 26 19 0:23 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw
>>> 27 19 0:24 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=102400k,mode=755
>>> 28 15 0:25 / /sys/fs/pstore rw,relatime - pstore none rw
>>> 29 20 253:1 / /boot rw,relatime - ext2 /dev/vda1 rw
>>>
>>>
>>> After creating snapshot:
>>>
>>> 15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
>>> 16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
>>> 17 20 0:5 / /dev rw,relatime - devtmpfs udev
>>> rw,size=241692k,nr_inodes=60423,mode=755
>>> 18 17 0:12 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts
>>> rw,gid=5,mode=620,ptmxmode=000
>>> 19 20 0:16 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs
>>> rw,size=50084k,mode=755
>>> 20 0 0:17 /@ / rw,relatime - btrfs /dev/dm-2 rw,space_cache
>>> <----- WTF?!
>>> 21 15 0:20 / /sys/fs/cgroup rw,relatime - tmpfs none
>>> rw,size=4k,mode=755
>>> 22 15 0:21 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
>>> 23 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
>>> 24 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
>>> 25 19 0:22 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=5120k
>>> 26 19 0:23 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw
>>> 27 19 0:24 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none
>>> rw,size=102400k,mode=755
>>> 28 15 0:25 / /sys/fs/pstore rw,relatime - pstore none rw
>>> 29 20 253:1 / /boot rw,relatime - ext2 /dev/vda1 rw
>>>
>>>
>>> So it's consistent with what /proc/mounts reports.
>>>
>>>
>>>>
>>>> And more important question: it is only the value
>>>> returned by /proc/mount wrongly or also the filesystem
>>>> content is affected ?
>>>>
>>>
>>> I quote my bug report on this:
>>>
>>> "The information reported in /proc/mounts is certainly bogus, since
>>> still the origin device is being written, the kernel does not actually
>>> mix up the devices for write operations, and such, the phenomenon does
>>> not cause data corruption. (I did an entire distro release upgrade
>>> while the conditions were present, and I centainly would have suffered
>>> severe data corruption otherwise. Fortunately, the origin device had
>>> the new distro, and the snapshot device had the old one, so besides
>>> the mixup in /proc/mounts, no actual damage happened.)"
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



  parent reply	other threads:[~2014-12-08  0:05 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-01 12:56 PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots MegaBrutal
2014-12-01 17:27 ` Robert White
2014-12-01 22:10   ` MegaBrutal
2014-12-01 23:24     ` Robert White
2014-12-02  0:15       ` MegaBrutal
2014-12-02  7:50         ` Goffredo Baroncelli
2014-12-02  8:28           ` MegaBrutal
2014-12-02 11:14             ` Goffredo Baroncelli
2014-12-02 11:54               ` Anand Jain
2014-12-02 12:23                 ` Austin S Hemmelgarn
2014-12-02 19:11                   ` Phillip Susi
2014-12-03  8:24                     ` Goffredo Baroncelli
2014-12-04  3:09                       ` Phillip Susi
2014-12-04  5:15                         ` Duncan
2014-12-04  8:20                           ` MegaBrutal
2014-12-04 13:14                             ` Duncan
2014-12-02 19:14                 ` Phillip Susi
2014-12-08  0:05                 ` Konstantin [this message]
2014-12-01 21:45 ` Konstantin
2014-12-02  5:47   ` MegaBrutal
2014-12-02 19:19   ` Phillip Susi
2014-12-03  3:01     ` Russell Coker
2014-12-08  0:32     ` Konstantin
2014-12-08 14:59       ` Phillip Susi
2014-12-08 22:25         ` Konstantin
2014-12-09 16:04           ` Phillip Susi
2014-12-10  3:10         ` Anand Jain
2014-12-10 15:57           ` Phillip Susi
2014-12-08 17:20       ` Robert White
2014-12-08 22:38         ` Konstantin
2014-12-08 23:17           ` Robert White

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5484EB3C.4000608@web.de \
    --to=newsbox1026@web.de \
    --cc=Anand.Jain@oracle.com \
    --cc=clm@fb.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=megabrutal@gmail.com \
    --cc=rwhite@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.