From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5DD71C4363A for ; Mon, 5 Oct 2020 15:58:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1BF7E2068E for ; Mon, 5 Oct 2020 15:58:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="bt4PGZRo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728917AbgJEP6t (ORCPT ); Mon, 5 Oct 2020 11:58:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53442 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726823AbgJEP6t (ORCPT ); Mon, 5 Oct 2020 11:58:49 -0400 Received: from mail-lf1-x134.google.com (mail-lf1-x134.google.com [IPv6:2a00:1450:4864:20::134]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 93E54C0613CE for ; Mon, 5 Oct 2020 08:58:48 -0700 (PDT) Received: by mail-lf1-x134.google.com with SMTP id 197so11469259lfo.11 for ; Mon, 05 Oct 2020 08:58:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=EoNjtPpjAxIkQlswIE/crHO37YRjjr42az+ntbKkEUw=; b=bt4PGZRo8pFKi1OMZ8hU7J9cFe0kM9bL2ziYuZyaRqB7PfMyzj+E9w+LipC7GPlyAs SpS51FTIqwhfIsxgd1DI8Y9xiKXKr6tQJ5NJx8tYXuMPjYYHrsOPCgBOK55v8f1U8t7U Mh9ARNvgJge7/4zx6wkSTc85wUzXAau0gRkjZ8rqr6AYOVeNvPilH2nAqbPylmnqVsAe XrRQJnB4K4ecI2misiWnk0d7t5JumLQjfmMCs0KIbzK5bbSOeniG+FL/hF2Y1rt+mBhn l+oVR5EVLc0EF4S90UnZP41hyLZWEvjZW5lI53ljBljiAOR0P+9IJ7i1IkbFzUzIGc17 UbRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=EoNjtPpjAxIkQlswIE/crHO37YRjjr42az+ntbKkEUw=; b=gmITDnaHOSOV0/0Mh3/tLC10jAKtVlIx/RJN+jkBP0YrYdKt7lt9LiYVu216zULYnn 0UpBftmSWhaqEtbGWXdThVV/eyyK4xm9otAu2PvLkv8lbBOgIiOVxTK7pj+Vl1vmwyXv YMhbMWC11CSswWU4k2fIWbffKglqzb//BbzczZJGhyZ3ugSldRb5HRSeY9SkAod16Aew vhJZp8Sh42nYr6tHNH1e6nQzrVW4k7Gbn7Yu62NuKKfJcVdmOiNrZi7lIdtozvA9wiTZ T+QgbCOQOvPFxKvKEXdp/df3ch8O5dLtj7x7y3H99btMRtDjus3JK3bShvGrt2Fd8IS0 n1Ag== X-Gm-Message-State: AOAM532xdVo37fVuFTaGePWGcEauOcoXSnHQd+MWhkZQM8q1mxV8c1DL RIUESUY3TWPfWyyWBFaH18IPoEdJYKWmuESq9Kk= X-Google-Smtp-Source: ABdhPJxHTeKiU/rbfGG2Buf9/hStiyvxWYhZWOn94rEODRXGKophlegc5GQaf3MhfzPRlWqV64vcRP2efJXPsxonM5I= X-Received: by 2002:ac2:5e87:: with SMTP id b7mr41898lfq.151.1601913526733; Mon, 05 Oct 2020 08:58:46 -0700 (PDT) MIME-Version: 1.0 References: <20201005184449.54225175@natsu> <20201005190421.4ecd8f1b@natsu> In-Reply-To: From: Roger Heflin Date: Mon, 5 Oct 2020 10:58:32 -0500 Message-ID: Subject: Re: do i need to give up on this setup To: Daniel Sanabria Cc: Roman Mamedov , Linux-RAID Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org what they said you have a hardware problem. it could be about anything previously mentioned and could also be the power supply being unable to provide a stable 12V for the disks. You should provide the list more specifics on your hw setup, of interest are what kind of SATA/SAS ports you are using and how the disk are cabled in. Note that there are a number of controllers that aren't the most reliable and some of those controllers when something happens will stop responding for all disks connected to it. I have also seen badly designed motherboards have build-in(non-AMD/non-Intel chips) sata ports that don't work under any load that uses more than a single disk at a time, and/or acts badly when given smart commands. On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria wrote: > > > I meant not to me personally, but to the mailing list. The drives seem OK > > though, even sde. > > Sorry missed the reply-all button > > On Mon, 5 Oct 2020 at 15:04, Roman Mamedov wrote: > > > > On Mon, 5 Oct 2020 14:59:35 +0100 > > Daniel Sanabria wrote: > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > "smartctl -a" of all drives as well. > > > > I meant not to me personally, but to the mailing list. The drives seem OK > > though, even sde. > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdc > > > [sudo] password for dan: > > > smartctl 6.6 2017-11-05 r4594 > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > === START OF INFORMATION SECTION === > > > Model Family: Western Digital Green > > > Device Model: WDC WD30EZRX-00D8PB0 > > > Serial Number: WD-WCC4NCWT13RF > > > LU WWN Device Id: 5 0014ee 25fc9e460 > > > Firmware Version: 80.00A80 > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > Rotation Rate: 5400 rpm > > > Device is: In smartctl database [for details use: -P show] > > > ATA Version is: ACS-2 (minor revision not indicated) > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > Local Time is: Mon Oct 5 14:58:34 2020 BST > > > SMART support is: Available - device has SMART capability. > > > SMART support is: Enabled > > > > > > === START OF READ SMART DATA SECTION === > > > SMART overall-health self-assessment test result: PASSED > > > > > > General SMART Values: > > > Offline data collection status: (0x82) Offline data collection activity > > > was completed without error. > > > Auto Offline Data Collection: Enabled. > > > Self-test execution status: ( 0) The previous self-test routine completed > > > without error or no self-test has ever > > > been run. > > > Total time to complete Offline > > > data collection: (38940) seconds. > > > Offline data collection > > > capabilities: (0x7b) SMART execute Offline immediate. > > > Auto Offline data collection on/off support. > > > Suspend Offline collection upon new > > > command. > > > Offline surface scan supported. > > > Self-test supported. > > > Conveyance Self-test supported. > > > Selective Self-test supported. > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > power-saving mode. > > > Supports SMART auto save timer. > > > Error logging capability: (0x01) Error logging supported. > > > General Purpose Logging supported. > > > Short self-test routine > > > recommended polling time: ( 2) minutes. > > > Extended self-test routine > > > recommended polling time: ( 391) minutes. > > > Conveyance self-test routine > > > recommended polling time: ( 5) minutes. > > > SCT capabilities: (0x7035) SCT Status supported. > > > SCT Feature Control supported. > > > SCT Data Table supported. > > > > > > SMART Attributes Data Structure revision number: 16 > > > Vendor Specific SMART Attributes with Thresholds: > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > UPDATED WHEN_FAILED RAW_VALUE > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > Always - 0 > > > 3 Spin_Up_Time 0x0027 178 165 021 Pre-fail > > > Always - 6075 > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > Always - 0 > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > Always - 0 > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > Always - 18577 > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > Always - 46 > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > Always - 176661 > > > 194 Temperature_Celsius 0x0022 122 109 000 Old_age > > > Always - 28 > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > Always - 0 > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > Offline - 0 > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > Offline - 0 > > > > > > SMART Error Log Version: 1 > > > No Errors Logged > > > > > > SMART Self-test log structure revision number 1 > > > Num Test_Description Status Remaining > > > LifeTime(hours) LBA_of_first_error > > > # 1 Extended offline Completed without error 00% 17479 - > > > # 2 Short offline Completed without error 00% 15531 - > > > > > > SMART Selective self-test log data structure revision number 1 > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > 1 0 0 Not_testing > > > 2 0 0 Not_testing > > > 3 0 0 Not_testing > > > 4 0 0 Not_testing > > > 5 0 0 Not_testing > > > Selective self-test flags (0x0): > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sdd > > > smartctl 6.6 2017-11-05 r4594 > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > === START OF INFORMATION SECTION === > > > Model Family: Western Digital Green > > > Device Model: WDC WD30EZRX-00D8PB0 > > > Serial Number: WD-WCC4NPRDD6D7 > > > LU WWN Device Id: 5 0014ee 25fca27b1 > > > Firmware Version: 80.00A80 > > > User Capacity: 3,000,592,982,016 bytes [3.00 TB] > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > Rotation Rate: 5400 rpm > > > Device is: In smartctl database [for details use: -P show] > > > ATA Version is: ACS-2 (minor revision not indicated) > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > Local Time is: Mon Oct 5 14:58:54 2020 BST > > > SMART support is: Available - device has SMART capability. > > > SMART support is: Enabled > > > > > > === START OF READ SMART DATA SECTION === > > > SMART overall-health self-assessment test result: PASSED > > > > > > General SMART Values: > > > Offline data collection status: (0x82) Offline data collection activity > > > was completed without error. > > > Auto Offline Data Collection: Enabled. > > > Self-test execution status: ( 0) The previous self-test routine completed > > > without error or no self-test has ever > > > been run. > > > Total time to complete Offline > > > data collection: (39060) seconds. > > > Offline data collection > > > capabilities: (0x7b) SMART execute Offline immediate. > > > Auto Offline data collection on/off support. > > > Suspend Offline collection upon new > > > command. > > > Offline surface scan supported. > > > Self-test supported. > > > Conveyance Self-test supported. > > > Selective Self-test supported. > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > power-saving mode. > > > Supports SMART auto save timer. > > > Error logging capability: (0x01) Error logging supported. > > > General Purpose Logging supported. > > > Short self-test routine > > > recommended polling time: ( 2) minutes. > > > Extended self-test routine > > > recommended polling time: ( 392) minutes. > > > Conveyance self-test routine > > > recommended polling time: ( 5) minutes. > > > SCT capabilities: (0x7035) SCT Status supported. > > > SCT Feature Control supported. > > > SCT Data Table supported. > > > > > > SMART Attributes Data Structure revision number: 16 > > > Vendor Specific SMART Attributes with Thresholds: > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > UPDATED WHEN_FAILED RAW_VALUE > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > Always - 0 > > > 3 Spin_Up_Time 0x0027 178 164 021 Pre-fail > > > Always - 6100 > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > Always - 0 > > > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age > > > Always - 0 > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > Always - 18580 > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > Always - 81 > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > Always - 53 > > > 193 Load_Cycle_Count 0x0032 136 136 000 Old_age > > > Always - 192427 > > > 194 Temperature_Celsius 0x0022 121 108 000 Old_age > > > Always - 29 > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > Always - 0 > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > Offline - 0 > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > Offline - 0 > > > > > > SMART Error Log Version: 1 > > > No Errors Logged > > > > > > SMART Self-test log structure revision number 1 > > > Num Test_Description Status Remaining > > > LifeTime(hours) LBA_of_first_error > > > # 1 Extended offline Completed without error 00% 17481 - > > > # 2 Short offline Completed without error 00% 15534 - > > > > > > SMART Selective self-test log data structure revision number 1 > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > 1 0 0 Not_testing > > > 2 0 0 Not_testing > > > 3 0 0 Not_testing > > > 4 0 0 Not_testing > > > 5 0 0 Not_testing > > > Selective self-test flags (0x0): > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > [dan@lamachine ~]$ sudo smartctl -a /dev/sde > > > smartctl 6.6 2017-11-05 r4594 > > > [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build) > > > Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org > > > > > > === START OF INFORMATION SECTION === > > > Model Family: Western Digital Green > > > Device Model: WDC WD30EZRX-00D8PB0 > > > Serial Number: WD-WCC4N1294906 > > > LU WWN Device Id: 5 0014ee 25f968120 > > > Firmware Version: 80.00A80 > > > User Capacity: 3,000,591,900,160 bytes [3.00 TB] > > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > > Rotation Rate: 5400 rpm > > > Device is: In smartctl database [for details use: -P show] > > > ATA Version is: ACS-2 (minor revision not indicated) > > > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > > > Local Time is: Mon Oct 5 14:58:57 2020 BST > > > SMART support is: Available - device has SMART capability. > > > SMART support is: Enabled > > > > > > === START OF READ SMART DATA SECTION === > > > SMART overall-health self-assessment test result: PASSED > > > > > > General SMART Values: > > > Offline data collection status: (0x82) Offline data collection activity > > > was completed without error. > > > Auto Offline Data Collection: Enabled. > > > Self-test execution status: ( 0) The previous self-test routine completed > > > without error or no self-test has ever > > > been run. > > > Total time to complete Offline > > > data collection: (43200) seconds. > > > Offline data collection > > > capabilities: (0x7b) SMART execute Offline immediate. > > > Auto Offline data collection on/off support. > > > Suspend Offline collection upon new > > > command. > > > Offline surface scan supported. > > > Self-test supported. > > > Conveyance Self-test supported. > > > Selective Self-test supported. > > > SMART capabilities: (0x0003) Saves SMART data before entering > > > power-saving mode. > > > Supports SMART auto save timer. > > > Error logging capability: (0x01) Error logging supported. > > > General Purpose Logging supported. > > > Short self-test routine > > > recommended polling time: ( 2) minutes. > > > Extended self-test routine > > > recommended polling time: ( 433) minutes. > > > Conveyance self-test routine > > > recommended polling time: ( 5) minutes. > > > SCT capabilities: (0x7035) SCT Status supported. > > > SCT Feature Control supported. > > > SCT Data Table supported. > > > > > > SMART Attributes Data Structure revision number: 16 > > > Vendor Specific SMART Attributes with Thresholds: > > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > > > UPDATED WHEN_FAILED RAW_VALUE > > > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail > > > Always - 0 > > > 3 Spin_Up_Time 0x0027 176 166 021 Pre-fail > > > Always - 6158 > > > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > > > Always - 80 > > > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail > > > Always - 0 > > > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age > > > Always - 0 > > > 9 Power_On_Hours 0x0032 075 075 000 Old_age > > > Always - 18465 > > > 10 Spin_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age > > > Always - 0 > > > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > > > Always - 80 > > > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age > > > Always - 53 > > > 193 Load_Cycle_Count 0x0032 142 142 000 Old_age > > > Always - 174015 > > > 194 Temperature_Celsius 0x0022 121 107 000 Old_age > > > Always - 29 > > > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > > > Always - 0 > > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > > Offline - 0 > > > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > > > Always - 0 > > > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age > > > Offline - 0 > > > > > > SMART Error Log Version: 1 > > > No Errors Logged > > > > > > SMART Self-test log structure revision number 1 > > > Num Test_Description Status Remaining > > > LifeTime(hours) LBA_of_first_error > > > # 1 Extended offline Completed without error 00% 17347 - > > > # 2 Short offline Completed without error 00% 15414 - > > > > > > SMART Selective self-test log data structure revision number 1 > > > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > > > 1 0 0 Not_testing > > > 2 0 0 Not_testing > > > 3 0 0 Not_testing > > > 4 0 0 Not_testing > > > 5 0 0 Not_testing > > > Selective self-test flags (0x0): > > > After scanning selected spans, do NOT read-scan remainder of disk. > > > If Selective self-test is pending on power-up, resume after 0 minute delay. > > > > > > [dan@lamachine ~]$ > > > > > > > > > On Mon, 5 Oct 2020 at 14:44, Roman Mamedov wrote: > > > > > > > > On Mon, 5 Oct 2020 14:10:25 +0100 > > > > Daniel Sanabria wrote: > > > > > > > > > Hi all, > > > > > > > > > > Scrubbing ( # echo check > > > > > > /sys/devices/virtual/block/md1/md/sync_action) is killing my array :( > > > > > > > > > > I'm attaching details of the array and disks (bloody wd greens) as > > > > > well as journalctl errors providing some details about the issue. > > > > > > > > > > If you have any pointers on what might be the cause of this as well as > > > > > any recommendations on how to improve things please let me thank you > > > > > in advance ... > > > > > > > > > > I have backups of the data so happy to move this to a different setup > > > > > you might recommend (apps will be mostly reading from the array via > > > > > NFS since most of the content will be media). > > > > > > > > > > My suspicion is that a timer service is kicking in and disrupting the > > > > > scrubbing somehow but can't pinpoint what causes this. > > > > > > > > It looks like a drive is dropping off the bus and then failing to reidentify, > > > > could be bad cabling/controller/PSU, or just a bad drive. You should post > > > > "smartctl -a" of all drives as well. > > > > > > > > -- > > > > With respect, > > > > Roman > > > > > > -- > > With respect, > > Roman