* mismatch_cnt constantly goes up on ssd+hdd raid1
@ 2015-06-14 17:13 tlknv
2015-06-25 1:33 ` NeilBrown
0 siblings, 1 reply; 8+ messages in thread
From: tlknv @ 2015-06-14 17:13 UTC (permalink / raw)
To: linux-raid
Hello,
I have raid 1 which mirrors a root/boot partition on 1SSD and 2HDD (write-mostly). mismatch_cnt goes up even when there are very few writes to the partition as /var is mounted separatly. After I update several packages I typically see mismatch_cnt somewhere between 500,000 and 2,000,000. I have read a number of threads in this DL but could not find an explanation of what could cause mismatch_cnt to grow that much. I checked md5 sums using /var/lib/dpkg/info/*.md5sums, and didn't see many errors, even though there are few, mostly in text files which look ok to me. I guess when I check, all reads go to SSD (as both HDDs in this raid are write-mostly), and thus md5sum only shows no problem on SSD. Note, this partition is used as both boot and root and just in case here is some more info about
my system:
root@tbeh:~# uname -a
Linux tbeh 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1 (2015-05-24) x86_64 GNU/Linux
root@tbeh:~# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sun Jun 7 18:38:51 2015
Raid Level : raid1
Array Size : 13442048 (12.82 GiB 13.76 GB)
Used Dev Size : 13442048 (12.82 GiB 13.76 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Sun Jun 14 08:12:28 2015
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Name : tbeh:0 (local to host tbeh)
UUID : c50d3fbf:5da849fc:9a6872ae:6905e381
Events : 213
Number Major Minor RaidDevice State
0 8 34 0 active sync /dev/sdc2
2 8 18 1 active sync writemostly /dev/sdb2
1 8 2 2 active sync writemostly /dev/sda2
root@tbeh:~# fdisk -l /dev/sdc
Disk /dev/sdc: 111.8 GiB, 120034123776 bytes, 234441648 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x858bea60
Device Boot Start End Sectors Size Id Type
/dev/sdc1 * 2048 50333695 50331648 24G 83 Linux
/dev/sdc2 50333696 77234175 26900480 12.8G da Non-FS data
root@tbeh:~# fdisk -l /dev/sda
Disk /dev/sda: 596.2 GiB, 640135028736 bytes, 1250263728 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x0f06bf61
Device Boot Start End Sectors Size Id Type
/dev/sda1 63 498014 497952 243.1M da Non-FS data
/dev/sda2 498015 27856709 27358695 13G da Non-FS data
/dev/sda3 27856710 35889209 8032500 3.9G da Non-FS data
/dev/sda4 35889210 1250258624 1214369415 579.1G 5 Extended
/dev/sda5 35889273 82782944 46893672 22.4G da Non-FS data
/dev/sda6 82783008 976768064 893985057 426.3G da Non-FS data
/dev/sda7 976768128 1250258624 273490497 130.4G 83 Linux
root@tbeh:~# fdisk -l /dev/sdb
Disk /dev/sdb: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x99a9f6d9
Device Boot Start End Sectors Size Id Type
/dev/sdb1 63 498014 497952 243.1M da Non-FS data
/dev/sdb2 498015 27856709 27358695 13G da Non-FS data
/dev/sdb3 27856710 35889209 8032500 3.9G da Non-FS data
/dev/sdb4 35889210 976768064 940878855 448.7G 5 Extended
/dev/sdb5 35889273 82782944 46893672 22.4G da Non-FS data
/dev/sdb6 82783008 976768064 893985057 426.3G da Non-FS data
Just to minize the damage, now I mount / as read only, and remount as rw as necessary. Unfortunatelly right now I don't have anything to update but AFAIR right after package update (and running echo check > /sys/block/md0/md/sync_action) mismatch_cnt wasn't too high, but it went up after I reboot the system (and ran echo check > /sys/block/md0/md/sync_action).
The following may have nothing to do with mismatch_cnt as it's observed even when mismatch_cnt is 0 (after checking of ro partition) but I want to undestand how it's possible.
Cmp of SSD and HDD partitions shows lots of differences
root@tbeh:~# sync; cmp -l /dev/sdc2 /dev/sda2|wc -l
cmp: EOF on /dev/sdc2
1903215
BTW, only first few hundren bytes (at most) have non-zero value on SSD, the rest of differences has 0 bytes on SSD.
4233 0 347
4234 70 65
4235 232 241
4257 0 1
4265 51 264
4266 271 260
4267 14 301
4268 116 317
4269 353 326
4270 21 221
4271 360 176
4272 133 265
4273 154 262
4274 56 120
4275 116 370
4276 304 72
4277 233 62
4278 241 4
4279 161 243
4280 363 353
4281 0 1
4313 31 125
4314 201 173
4315 34 102
4316 15 127
4609 0 376
4610 0 377
4611 0 376
4612 0 377
4613 0 376
4614 0 377
4615 0 376
4616 0 377
4617 0 376
4618 0 377
4619 0 376
4620 0 377
4621 0 376
4622 0 377
4623 0 376
4624 0 377
4625 0 376
4626 0 377
4627 0 376
4628 0 377
4629 0 376
...
I don't see any differences between 2 HDD partitions though.
Does anyone have any idea what could be wrong with my system or what could I try to localize the problem?
Thanks,
Boris
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mismatch_cnt constantly goes up on ssd+hdd raid1
2015-06-14 17:13 mismatch_cnt constantly goes up on ssd+hdd raid1 tlknv
@ 2015-06-25 1:33 ` NeilBrown
2015-06-25 5:19 ` Roman Mamedov
0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2015-06-25 1:33 UTC (permalink / raw)
To: tlknv; +Cc: linux-raid
On Sun, 14 Jun 2015 20:13:16 +0300 tlknv <tlknv@yandex.ru> wrote:
> Hello,
> I have raid 1 which mirrors a root/boot partition on 1SSD and 2HDD
> (write-mostly). mismatch_cnt goes up even when there are very few
> writes to the partition as /var is mounted separatly. After I update
> several packages I typically see mismatch_cnt somewhere between
> 500,000 and 2,000,000. I have read a number of threads in this DL
> but could not find an explanation of what could cause mismatch_cnt
> to grow that much. I checked md5 sums using
> /var/lib/dpkg/info/*.md5sums, and didn't see many errors, even
> though there are few, mostly in text files which look ok to me. I
> guess when I check, all reads go to SSD (as both HDDs in this raid
> are write-mostly), and thus md5sum only shows no problem on
> SSD. Note, this partition is used as both boot and root and just in
> case here is some more info about my system:
This does surprise me.
I had another look at the code and there could be a bug that would let
'check' see the difference between when the first write completes and
when the write-behind writes complete, but you would need to run the
check while the install was happening for that to be noticed, and even
then you would need to be unlucky.
What you could try is:
- add a bitmap (mdadm --grow /dev/md0 --bitmap=internal) so that
recovery will be fast if you remove then re-add a device.
- fail and remove one of the HDDs
mdadm /dev/md0 --fail /dev/sda2
mdadm /dev/md0 --remove /dev/sda2
- Find the data offset and use losetup to access the data directly.
mdadm --examine /dev/sda2 | grep 'Data Offset'
Data Offset : 160 sectors.
convert that to 'K' and
losetup --read-only --offset=80K /dev/loop0 /dev/sda2
- perform some *read-only* examintion of loop0.
fsck -n /dev/loop0
mount -o ro /dev/loop0 /mnt
and see if there are any differences in files that have changed
recently.
- when finished, "umount /mnt", "losetup -d /dev/loop0" and
mdadm /dev/md0 --re-add /dev/sda2
> root@tbeh:~# sync; cmp -l /dev/sdc2 /dev/sda2|wc -l
> cmp: EOF on /dev/sdc2
> 1903215
>
> BTW, only first few hundren bytes (at most) have non-zero value on SSD, the rest of differences has 0 bytes on SSD.
> 4233 0 347
> 4234 70 65
> 4235 232 241
> 4257 0 1
Any bytes before the "Data Offset" identified above could easily be
different, or after "Data Offset" + "Used Dev Size".
What bytes are different within that range/
NeilBrown
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mismatch_cnt constantly goes up on ssd+hdd raid1
2015-06-25 1:33 ` NeilBrown
@ 2015-06-25 5:19 ` Roman Mamedov
2015-06-25 5:30 ` Brad Campbell
2015-06-25 7:25 ` NeilBrown
0 siblings, 2 replies; 8+ messages in thread
From: Roman Mamedov @ 2015-06-25 5:19 UTC (permalink / raw)
To: NeilBrown; +Cc: tlknv, linux-raid
[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]
On Thu, 25 Jun 2015 11:33:35 +1000
NeilBrown <neilb@suse.com> wrote:
> On Sun, 14 Jun 2015 20:13:16 +0300 tlknv <tlknv@yandex.ru> wrote:
>
> > Hello,
>
> > I have raid 1 which mirrors a root/boot partition on 1SSD and 2HDD
> > (write-mostly). mismatch_cnt goes up even when there are very few
> > writes to the partition as /var is mounted separatly. After I update
> > several packages I typically see mismatch_cnt somewhere between
> > 500,000 and 2,000,000. I have read a number of threads in this DL
> > but could not find an explanation of what could cause mismatch_cnt
> > to grow that much. I checked md5 sums using
> > /var/lib/dpkg/info/*.md5sums, and didn't see many errors, even
> > though there are few, mostly in text files which look ok to me. I
> > guess when I check, all reads go to SSD (as both HDDs in this raid
> > are write-mostly), and thus md5sum only shows no problem on
> > SSD. Note, this partition is used as both boot and root and just in
> > case here is some more info about my system:
>
> This does surprise me.
>
> I had another look at the code and there could be a bug that would let
> 'check' see the difference between when the first write completes and
> when the write-behind writes complete, but you would need to run the
> check while the install was happening for that to be noticed, and even
> then you would need to be unlucky.
Couldn't this be simply the normal observed effect of using TRIM on SSD?
After deleting some files, the filesystem issues a discard request, it
does nothing to the HDDs, but the content of the discared areas on SSD is no
longer deterministic (or mostly zeroed, as mentioned in the original report).
So there is now a mismatch between the content of HDDs and SSD, but since it
is in the area of deleted files, it doesn't affect the system in any way.
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mismatch_cnt constantly goes up on ssd+hdd raid1
2015-06-25 5:19 ` Roman Mamedov
@ 2015-06-25 5:30 ` Brad Campbell
2015-06-25 7:25 ` NeilBrown
1 sibling, 0 replies; 8+ messages in thread
From: Brad Campbell @ 2015-06-25 5:30 UTC (permalink / raw)
To: Roman Mamedov, NeilBrown; +Cc: tlknv, linux-raid
On 25/06/15 13:19, Roman Mamedov wrote:
> Couldn't this be simply the normal observed effect of using TRIM on SSD?
> After deleting some files, the filesystem issues a discard request, it
> does nothing to the HDDs, but the content of the discared areas on SSD is no
> longer deterministic (or mostly zeroed, as mentioned in the original report).
> So there is now a mismatch between the content of HDDs and SSD, but since it
> is in the area of deleted files, it doesn't affect the system in any way.
>
I get this on a RAID10 with Intel & Samsung SSD's. One supports
deterministic after TRIM and the other doesn't. Mismatch count is always
through the roof as a result but there are no other negative effects.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mismatch_cnt constantly goes up on ssd+hdd raid1
2015-06-25 5:19 ` Roman Mamedov
2015-06-25 5:30 ` Brad Campbell
@ 2015-06-25 7:25 ` NeilBrown
2015-06-25 15:33 ` tlknv
1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2015-06-25 7:25 UTC (permalink / raw)
To: Roman Mamedov; +Cc: tlknv, linux-raid
On Thu, 25 Jun 2015 10:19:59 +0500 Roman Mamedov <rm@romanrm.net> wrote:
> On Thu, 25 Jun 2015 11:33:35 +1000
> NeilBrown <neilb@suse.com> wrote:
>
> > On Sun, 14 Jun 2015 20:13:16 +0300 tlknv <tlknv@yandex.ru> wrote:
> >
> > > Hello,
> >
> > > I have raid 1 which mirrors a root/boot partition on 1SSD and 2HDD
> > > (write-mostly). mismatch_cnt goes up even when there are very few
> > > writes to the partition as /var is mounted separatly. After I update
> > > several packages I typically see mismatch_cnt somewhere between
> > > 500,000 and 2,000,000. I have read a number of threads in this DL
> > > but could not find an explanation of what could cause mismatch_cnt
> > > to grow that much. I checked md5 sums using
> > > /var/lib/dpkg/info/*.md5sums, and didn't see many errors, even
> > > though there are few, mostly in text files which look ok to me. I
> > > guess when I check, all reads go to SSD (as both HDDs in this raid
> > > are write-mostly), and thus md5sum only shows no problem on
> > > SSD. Note, this partition is used as both boot and root and just in
> > > case here is some more info about my system:
> >
> > This does surprise me.
> >
> > I had another look at the code and there could be a bug that would let
> > 'check' see the difference between when the first write completes and
> > when the write-behind writes complete, but you would need to run the
> > check while the install was happening for that to be noticed, and even
> > then you would need to be unlucky.
>
> Couldn't this be simply the normal observed effect of using TRIM on SSD?
Yes, of course it could. I try not to think about TRIM to much - makes me ill :-)
Thanks,
NeilBrown
> After deleting some files, the filesystem issues a discard request, it
> does nothing to the HDDs, but the content of the discared areas on SSD is no
> longer deterministic (or mostly zeroed, as mentioned in the original report).
> So there is now a mismatch between the content of HDDs and SSD, but since it
> is in the area of deleted files, it doesn't affect the system in any way.
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mismatch_cnt constantly goes up on ssd+hdd raid1
2015-06-25 7:25 ` NeilBrown
@ 2015-06-25 15:33 ` tlknv
2015-06-25 15:50 ` Roman Mamedov
0 siblings, 1 reply; 8+ messages in thread
From: tlknv @ 2015-06-25 15:33 UTC (permalink / raw)
To: NeilBrown, Roman Mamedov; +Cc: linux-raid
Neil,
Thanks a lot for all the info and steps to identify the problem.
I have just discovered that I had 'discard' mount option even though I though it wasn't there :-(
After removing 'discard' and forcing 'repair' mismatch_cnt stays 0 even after a bunch of writes and deletes (the most importantly) to the partition. BTW, what are the units in mismatch_cnt? Is it 512 sectors or something else?
AFAIU md could potentially collect info on trimmed sectors/blocks and exclude them from mismatch checking. Could not it?
I'll look at the range of the sectors which are different even when mismatch_cnt is 0.
Thanks again,
Boris
25.06.2015, 10:25, "NeilBrown" <neilb@suse.com>:
> On Thu, 25 Jun 2015 10:19:59 +0500 Roman Mamedov <rm@romanrm.net> wrote:
>
>> On Thu, 25 Jun 2015 11:33:35 +1000
>> NeilBrown <neilb@suse.com> wrote:
>>
>> > On Sun, 14 Jun 2015 20:13:16 +0300 tlknv <tlknv@yandex.ru> wrote:
>> >
>> > > Hello,
>> >
>> > > I have raid 1 which mirrors a root/boot partition on 1SSD and 2HDD
>> > > (write-mostly). mismatch_cnt goes up even when there are very few
>> > > writes to the partition as /var is mounted separatly. After I update
>> > > several packages I typically see mismatch_cnt somewhere between
>> > > 500,000 and 2,000,000. I have read a number of threads in this DL
>> > > but could not find an explanation of what could cause mismatch_cnt
>> > > to grow that much. I checked md5 sums using
>> > > /var/lib/dpkg/info/*.md5sums, and didn't see many errors, even
>> > > though there are few, mostly in text files which look ok to me. I
>> > > guess when I check, all reads go to SSD (as both HDDs in this raid
>> > > are write-mostly), and thus md5sum only shows no problem on
>> > > SSD. Note, this partition is used as both boot and root and just in
>> > > case here is some more info about my system:
>> >
>> > This does surprise me.
>> >
>> > I had another look at the code and there could be a bug that would let
>> > 'check' see the difference between when the first write completes and
>> > when the write-behind writes complete, but you would need to run the
>> > check while the install was happening for that to be noticed, and even
>> > then you would need to be unlucky.
>>
>> Couldn't this be simply the normal observed effect of using TRIM on SSD?
>
> Yes, of course it could. I try not to think about TRIM to much - makes me ill :-)
>
> Thanks,
> NeilBrown
>
>> After deleting some files, the filesystem issues a discard request, it
>> does nothing to the HDDs, but the content of the discared areas on SSD is no
>> longer deterministic (or mostly zeroed, as mentioned in the original report).
>> So there is now a mismatch between the content of HDDs and SSD, but since it
>> is in the area of deleted files, it doesn't affect the system in any way.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mismatch_cnt constantly goes up on ssd+hdd raid1
2015-06-25 15:33 ` tlknv
@ 2015-06-25 15:50 ` Roman Mamedov
2015-06-25 17:24 ` Boris T
0 siblings, 1 reply; 8+ messages in thread
From: Roman Mamedov @ 2015-06-25 15:50 UTC (permalink / raw)
To: tlknv; +Cc: NeilBrown, linux-raid
[-- Attachment #1: Type: text/plain, Size: 685 bytes --]
On Thu, 25 Jun 2015 18:33:16 +0300
tlknv <tlknv@yandex.ru> wrote:
> I have just discovered that I had 'discard' mount option even though I though it wasn't there :-(
> After removing 'discard' and forcing 'repair' mismatch_cnt stays 0 even after a bunch of writes and deletes (the most importantly) to the partition.
I wouldn't recommend disabling discard 'for good' though, from your experiment
we can probably conclude the mismatch_cnt numbers you had previously are indeed
harmless. If you had no other issues due to discard being active, then what you
did now is disable a working and useful feature to fix what's only a cosmetic
problem.
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mismatch_cnt constantly goes up on ssd+hdd raid1
2015-06-25 15:50 ` Roman Mamedov
@ 2015-06-25 17:24 ` Boris T
0 siblings, 0 replies; 8+ messages in thread
From: Boris T @ 2015-06-25 17:24 UTC (permalink / raw)
To: Roman Mamedov; +Cc: tlknv, NeilBrown, linux-raid
Roman,
Thanks for the suggestion. I would probably consider keeping TRIM if I
had less free/unpartitioned space on my SSD. Currently all partitioned
space is just a fraction (about 1/3) of total SSD size. In this
situation I prefer to make sure that my raid is consistent than to
prolong the life of SSD a bit.
Thanks,
Boris
On Thu, Jun 25, 2015 at 8:50 AM, Roman Mamedov <rm@romanrm.net> wrote:
> On Thu, 25 Jun 2015 18:33:16 +0300
> tlknv <tlknv@yandex.ru> wrote:
>
>> I have just discovered that I had 'discard' mount option even though I though it wasn't there :-(
>> After removing 'discard' and forcing 'repair' mismatch_cnt stays 0 even after a bunch of writes and deletes (the most importantly) to the partition.
>
> I wouldn't recommend disabling discard 'for good' though, from your experiment
> we can probably conclude the mismatch_cnt numbers you had previously are indeed
> harmless. If you had no other issues due to discard being active, then what you
> did now is disable a working and useful feature to fix what's only a cosmetic
> problem.
>
> --
> With respect,
> Roman
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-06-25 17:24 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-14 17:13 mismatch_cnt constantly goes up on ssd+hdd raid1 tlknv
2015-06-25 1:33 ` NeilBrown
2015-06-25 5:19 ` Roman Mamedov
2015-06-25 5:30 ` Brad Campbell
2015-06-25 7:25 ` NeilBrown
2015-06-25 15:33 ` tlknv
2015-06-25 15:50 ` Roman Mamedov
2015-06-25 17:24 ` Boris T
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.