From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E73AAC3A5A2 for ; Fri, 23 Aug 2019 17:47:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B6D572082F for ; Fri, 23 Aug 2019 17:47:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727460AbfHWRrP (ORCPT ); Fri, 23 Aug 2019 13:47:15 -0400 Received: from mx2.suse.de ([195.135.220.15]:33224 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725892AbfHWRrP (ORCPT ); Fri, 23 Aug 2019 13:47:15 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BC820ADAA; Fri, 23 Aug 2019 17:47:13 +0000 (UTC) Subject: Re: [RFC] How to handle an ugly md raid0 sector map bug ? To: Song Liu Cc: NeilBrown , linux-raid , "linux-block@vger.kernel.org" References: <10ca59ff-f1ba-1464-030a-0d73ff25d2de@suse.de> <87blwghhq7.fsf@notabene.neil.brown.name> <9008538C-A2BE-429C-A90E-18FBB91E7B34@fb.com> From: Coly Li Openpgp: preference=signencrypt Organization: SUSE Labs Message-ID: Date: Sat, 24 Aug 2019 01:47:07 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <9008538C-A2BE-429C-A90E-18FBB91E7B34@fb.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On 2019/8/24 1:17 上午, Song Liu wrote: > > >> On Aug 23, 2019, at 10:03 AM, Coly Li wrote: >> >> On 2019/8/24 12:37 上午, Song Liu wrote: >>> Thanks Coly and Neil. >>> >>>> On Aug 22, 2019, at 5:02 PM, NeilBrown wrote: >>>> >>>> On Thu, Aug 22 2019, Coly Li wrote: >>>> >>>>> Hi folks, >>>>> >>>>> First line: This bug only influences md raid0 device which applies all >>>>> the following conditions, >>>>> 1) Assembled by component disks with different sizes. >>>>> 2) Created and used under Linux kernel before (including) Linux v3.12, >>>>> then upgrade to Linux kernel after (including) Linux v3.13. >>>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13. >>>>> Then the md raid0 may have inconsistent sector mapping and experience >>>>> data corruption. >>>>> >>>>> Recently I receive a bug report that customer encounter file system >>>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns >>>>> out to be the underlying md raid0 corruption after the kernel upgrade. >>>>> >>>>> I find it is because a sector map bug in md raid0 code include and >>>>> before Linux v3.12. Here is the buggy code piece I copied from stable >>>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), >>>>> >>>>> 547 sector_offset = bio->bi_sector; >>>>> 548 zone = find_zone(mddev->private, §or_offset); >>>>> 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, >>>>> 550 §or_offset); >>>> >>>> I don't think this code is buggy. The mapping may not be the mapping >>>> you would expect, but it is the mapping that md/raid0 had always used up >>>> to this time. >>>> >>>>> >>>>> At line 548 after find_zone() returns, sector_offset is updated to be an >>>>> offset inside current zone. Then at line 549 the third parameter of >>>>> calling map_sector() should be the updated sector_offset, but >>>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 >>>>> device has *multiple zones*, except the first zone, the mapping >>>> sector> pair returned by map_sector() for all rested zones are >>>>> unexpected and wrong. >>>>> >>>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 >>>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate >>>>> the mistaken mapping calculation has stable and unique result too, so it >>>>> works without obvious problem until commit 20d0189b1012 ("block: >>>>> Introduce new bio_split()") merged into Linux v3.13. >>>>> >>>>> This patch fixed the mistaken mapping in the following lines of change, >>>>> 654 - sector_offset = bio->bi_iter.bi_sector; >>>>> 655 - zone = find_zone(mddev->private, §or_offset); >>>>> 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, >>>>> 657 - §or_offset); >>>>> >>>>> 694 + zone = find_zone(mddev->private, §or); >>>>> 695 + tmp_dev = map_sector(mddev, zone, sector, §or); >>>>> At line 695 of this patch, the third parameter of calling map_sector() >>>>> is fixed to 'sector', this is the correct value which contains the >>>>> sector offset inside the corresponding zone. >>>> >>>> This is buggy because, as you say, the third argument to map_sector has >>>> changed. >>>> Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which >>>> find_zone has just modified. >>>> >>>>> >>>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md >>>>> raid0 has component disks with *different* sizes, then it will contain >>>>> multiple zones. If such multiple zones raid0 device is created before >>>>> Linux v3.13, all data chunks after first zone will be mapped to >>>>> different location in kernel after (including) Linux v3.13. The result >>>>> is, data written in the LBA after first zone will be treated as >>>>> corruption. A worse case is, if the md raid0 has data chunks filled in >>>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to >>>>> Linux v3.13 (or later kernels) and fill more data chunks in second and >>>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always >>>>> partial data corrupted. >>>>> >>>>> Currently there is no way to tell whether a md raid0 device is mapped in >>>>> wrong calculation in kernel before (including) Linux v3.12 or in correct >>>>> calculation in kernels after (including) Linux v3.13. If a md raid0 >>>>> device (contains multiple zones) created and used crossing these kernel >>>>> version, there is possibility and different mapping calculation >>>>> generation different/inconsistent on-disk layout in different md raid0 >>>>> zones, and results data corruption. >>>>> >>>>> For our enterprise Linux products we can handle it properly for a few >>>>> product number of kernels. But for upstream and stable kernels, I don't >>>>> have idea how to fix this ugly problem in a generic way. >>>>> >>>>> Neil Brown discussed with me offline, he proposed a temporary workaround >>>>> that only permit to assemble md raid0 device with identical component >>>>> disk size, and reject to assemble md raid0 device with component disks >>>>> with different sizes. We can stop this workaround when there is a proper >>>>> way to fix the problem. >>>>> >>>>> I suggest our developer community to work together for a solution, this >>>>> is the motivation I post this email for your comments. >>>> >>>> There are four separate cases that we need to consider: >>>> - v1.x metadata >>>> - v0.90 metadata >>>> - LVM metadata (via dm-raid) >>>> - no metadata (array created with "mdadm --build"). >>>> >>>> For v1.x metadata, I think we can add a new "feature_map" flag. >>>> If this flag isn't set, raid0 with non-uniform device sizes will not be >>>> assembled. >>>> If it is set, then: >>>> if 'layout' is 0, use the old mapping >>>> if 'layout' is 1, use the new mapping >>>> >>>> For v0.90 metadata we don't have feature-flags. We could >>>> The gvalid_words field is unused and always set to zero. >>>> So we could start storing some feature bits there. >>>> >>>> For LVM/dm-raid, I suspect it doesn't support varying >>>> sized devices, but we would need to check. >>>> >>>> For "no metadata" arrays ... we could possibly just stop supporting >>>> them - I doubt they are used much. >>> >>> So for an existing array, we really cannot tell whether it is broken or >>> not, right? If this is the case, we only need to worry about new arrays. >>> >>> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform >>> devices sizes, and use the new feature_map bit. >>> >>> Would this work? If so, we only have 1 case to work on. >> >> It seems v1.2 support started since Linux v2.16, so it may also have >> problem for multiple zones. >> > > For v1.2 metadata, we still need the feature_map bit, meaning this > non-uniform array is safe to assemble. If the array doesn't have > this bit, and is non-uniform size, we refuse to assemble it. Yes, it make sense. I understand you now :-) -- Coly Li