dm-crypt.saout.de archive mirror
 help / color / mirror / Atom feed
* [dm-crypt] Help request
@ 2020-07-06  3:36 lacedaemonius
  2020-07-07 20:58 ` Michael Kjörling
  0 siblings, 1 reply; 4+ messages in thread
From: lacedaemonius @ 2020-07-06  3:36 UTC (permalink / raw)
  To: dm-crypt

[-- Attachment #1: Type: text/plain, Size: 3103 bytes --]

I have what to my suprise appears to be an uncommon situation and must humbly ask assistance. I say uncommon because it's not covered by the faq and web search hasn't turned up any results. The issue is I have an ext4 partition on a LUKS encrypted drive and the computer it's mounted on restarted without cleanly unmounting and closing the drive. Now if I try to mount the drive bash hangs for several minutes and then claims that it's not a valid LUKS device. Pulling dmesg shows a number of I/O errors. Here's a sample:

[ 643.631782] print_req_error: critical target error, dev sdi, sector 11721044993
[ 643.631789] Buffer I/O error on dev sdi, logical block 11721044993, async page read
[ 649.107468] sd 8:0:0:0: [sdi] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 649.107473] sd 8:0:0:0: [sdi] tag#24 Sense Key : Hardware Error [current]
[ 649.107479] sd 8:0:0:0: [sdi] tag#24 ASC=0x44 <<vendor>>ASCQ=0x81
[ 649.107484] sd 8:0:0:0: [sdi] tag#24 CDB: Read(16) 88 00 00 00 00 02 ba a0 f4 02 00 00 00 01 00 00
[ 649.107487] print_req_error: critical target error, dev sdi, sector 11721044994
[ 649.107496] Buffer I/O error on dev sdi, logical block 11721044994, async page read
[ 654.583148] sd 8:0:0:0: [sdi] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 654.583152] sd 8:0:0:0: [sdi] tag#25 Sense Key : Hardware Error [current]
[ 654.583157] sd 8:0:0:0: [sdi] tag#25 ASC=0x44 <<vendor>>ASCQ=0x81
[ 654.583160] sd 8:0:0:0: [sdi] tag#25 CDB: Read(16) 88 00 00 00 00 02 ba a0 f4 03 00 00 00 01 00 00
[ 654.583163] print_req_error: critical target error, dev sdi, sector 11721044995
[ 654.586033] Buffer I/O error on dev sdi, logical block 11721044995, async page read
[ 660.058859] sd 8:0:0:0: [sdi] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 660.058863] sd 8:0:0:0: [sdi] tag#26 Sense Key : Hardware Error [current]
[ 660.058869] sd 8:0:0:0: [sdi] tag#26 ASC=0x44 <<vendor>>ASCQ=0x81
[ 660.058874] sd 8:0:0:0: [sdi] tag#26 CDB: Read(16) 88 00 00 00 00 02 ba a0 f4 04 00 00 00 01 00 00
[ 660.058877] print_req_error: critical target error, dev sdi, sector 11721044996
[ 660.062377] Buffer I/O error on dev sdi, logical block 11721044996, async page read
[ 662.783221] sd 8:0:0:0: [sdi] tag#28 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD IN
[ 662.783227] sd 8:0:0:0: [sdi] tag#28 CDB: Read(16) 88 00 00 00 00 02 ba a0 f4 06 00 00 00 01 00 00
[ 662.783327] sd 8:0:0:0: [sdi] tag#27 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD IN
[ 662.783332] sd 8:0:0:0: [sdi] tag#27 CDB: Read(16) 88 00 00 00 00 02 ba a0 f4 05 00 00 00 01 00 00
[ 662.783428] sd 8:0:0:0: [sdi] tag#0 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD IN
[ 662.783434] sd 8:0:0:0: [sdi] tag#0 CDB: Read(16) 88 00 00 00 00 02 ba a0 f4 07 00 00 00 01 00 00

I don't think it's a drive failure because it's only a few months old and I haven't got any SMART warnings, so that leaves software. Is it worth making any attempt at trying to recover the drive and if so is there any documentation that explains what to do? I don't have a backup of the LUKS header, if that's the problem.

[-- Attachment #2: Type: text/html, Size: 3446 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-crypt] Help request
  2020-07-06  3:36 [dm-crypt] Help request lacedaemonius
@ 2020-07-07 20:58 ` Michael Kjörling
  2020-07-07 23:43   ` Arno Wagner
  2020-07-08  0:43   ` Robert Nichols
  0 siblings, 2 replies; 4+ messages in thread
From: Michael Kjörling @ 2020-07-07 20:58 UTC (permalink / raw)
  To: dm-crypt

On 6 Jul 2020 03:36 +0000, from lacedaemonius1@protonmail.com (lacedaemonius):
> [ 643.631782] print_req_error: critical target error, dev sdi, sector 11721044993
> [ 643.631789] Buffer I/O error on dev sdi, logical block 11721044993, async page read

Notice that the errors are occuring on the raw device, not through a
dm-* mapping. That sector address is just past the 6 TB (about 5.46
TiB) mark; does that sound reasonable given the drive size? (It would
if the physical drive is _more_ than 6 TB in size, and it might if the
drive is advertised as 6 TB.) Assuming that the problematic drive is
still detected as sdi, what's the contents of /sys/block/sdi/size?
(That should be _at least_ 11721044993; otherwise, some metadata
somewhere has been corrupted.)

If you luksOpen the LUKS container and "file -Ls" the corresponding
file in /dev/mapper, then what is the output of that? It should
indicate an ext4 file system in your case.

If that too fails, then I would suggest a pass of ddrescue reading
from the raw backing device and writing to /dev/null. (If you do this,
make VERY VERY SURE that you get the order right!) That will tell you
whether the data on the drive itself can be read without errors. If
you have enough storage elsewhere to make a copy of the whole contents
of the drive, strongly consider writing it there instead of throwing
it away; it can't hurt, and it might help. If you do this, expect it
to take the better part of a day to complete. (6 TB at 100 MB/s is
16-17 hours; you haven't specified the drive size, and 100 MB/s is a
reasonable average for a 7200 rpm rotational drive.)

That you're seeing delays of several seconds for those reads, and
user-visible delays of more than that, suggests to me that it's not
just an out-of-bounds read command issued to the drive, which should
return more or less immediately with something like sector not found,
which in turn would be propagated as an I/O error.

Is the LUKS container LUKS 1 or LUKS 2? Is the drive GPT partitioned,
or something else?


> I don't think it's a drive failure because it's only a few months
> old and I haven't got any SMART warnings, so that leaves software.

Unfortunately, drives can fail without reporting failures in SMART
data, and they can fail early. While the probability of either is
_lower_, it is non-zero.

An in-use drive failing certainly can cause issues to the running
system. A drive failing but not holding swap or a critical file system
_shouldn't_ cause the kernel to crash, but I wouldn't completely rule
out the possibility.

The fact that the LUKS container was not closed _should_ not cause any
issues after a reboot, because closing the container really just
removes bookkeeping information and cryptographic keys from kernel
memory; it doesn't affect on-disk data. An unclean shutdown isn't
ideal for ext4, but it's usually not catastrophic.

> Is it worth making any attempt at trying to recover the drive and if
> so is there any documentation that explains what to do? I don't have
> a backup of the LUKS header, if that's the problem.

Do you have a recent backup of the data on the drive, or does the
drive that is giving you problems hold the only copy? Is it data that
you care a lot about, or can it be easily restored from other sources?
(This basically boils down to: how important is it to rescue the data
in-place?)

-- 
Michael Kjörling • https://michael.kjorling.se • michael@kjorling.se
 “Remember when, on the Internet, nobody cared that you were a dog?”

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-crypt] Help request
  2020-07-07 20:58 ` Michael Kjörling
@ 2020-07-07 23:43   ` Arno Wagner
  2020-07-08  0:43   ` Robert Nichols
  1 sibling, 0 replies; 4+ messages in thread
From: Arno Wagner @ 2020-07-07 23:43 UTC (permalink / raw)
  To: dm-crypt

On Tue, Jul 07, 2020 at 22:58:14 CEST, Michael Kjörling wrote:
> On 6 Jul 2020 03:36 +0000, from lacedaemonius1@protonmail.com (lacedaemonius):
[...]
> > I don't think it's a drive failure because it's only a few months
> > old and I haven't got any SMART warnings, so that leaves software.
> 
> Unfortunately, drives can fail without reporting failures in SMART
> data, and they can fail early. While the probability of either is
> _lower_, it is non-zero.

There are some numbers from IBM that say that drives fail about 50% 
of the time without a failed SMART status. These are old numbers,
not sure how much they apply these days.It alsois a good idea to 
mionitor SMART attributes directly and not only check for a failed 
status.

One thing to do is to run a long SMART selftest (basically a more
sophisticated variant of scanning the surface yourself). It does
read the whole surface to check for read errors:

 smartctl -t long <drive>

Depending on drive size, this can take a long time. It also makes
sense to look at the SMART attributes manually, a drive will
get a failed SMART status only when things are really bad.
But you often can see what is going on anyways.
This gives you the attributes:

 smartctl -a <drive>

Post the results here if you are unsure how to interpret them.

[...]

> The fact that the LUKS container was not closed _should_ not cause any
> issues after a reboot, because closing the container really just
> removes bookkeeping information and cryptographic keys from kernel
> memory; it doesn't affect on-disk data. 

In fact, LUKS data never gets written in normal operation. Hence
it is not actually "opened" as far as the on-disk status is concerned.

> > Is it worth making any attempt at trying to recover the drive and if
> > so is there any documentation that explains what to do? I don't have
> > a backup of the LUKS header, if that's the problem.
> 
> Do you have a recent backup of the data on the drive, or does the
> drive that is giving you problems hold the only copy? Is it data that
> you care a lot about, or can it be easily restored from other sources?
> (This basically boils down to: how important is it to rescue the data
> in-place?)

If this is broken hardware in the drive (and it looks like it), this 
will be something you likely cannot do yourself and it will be really 
expensive. It will also fail unless a valid keyslot gets recovered 
fully. It may be a broken controller or even some other hardware 
damage on the system you are trying to read the disk, so it may be 
worthwhile trying to read it somewhere else. 

The detailed SMART status should tell more though.

Regards,
Arno
-- 
Arno Wagner,     Dr. sc. techn., Dipl. Inform.,    Email: arno@wagner.name
GnuPG: ID: CB5D9718  FP: 12D6 C03B 1B30 33BB 13CF  B774 E35C 5FA1 CB5D 9718
----
A good decision is based on knowledge and not on numbers. -- Plato

If it's in the news, don't worry about it.  The very definition of 
"news" is "something that hardly ever happens." -- Bruce Schneier

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-crypt] Help request
  2020-07-07 20:58 ` Michael Kjörling
  2020-07-07 23:43   ` Arno Wagner
@ 2020-07-08  0:43   ` Robert Nichols
  1 sibling, 0 replies; 4+ messages in thread
From: Robert Nichols @ 2020-07-08  0:43 UTC (permalink / raw)
  To: dm-crypt

On 7/7/20 3:58 PM, Michael Kjörling wrote:
> On 6 Jul 2020 03:36 +0000, from lacedaemonius1@protonmail.com (lacedaemonius):
>> [ 643.631782] print_req_error: critical target error, dev sdi, sector 11721044993
>> [ 643.631789] Buffer I/O error on dev sdi, logical block 11721044993, async page read
...
> Unfortunately, drives can fail without reporting failures in SMART
> data, and they can fail early. While the probability of either is
> _lower_, it is non-zero.

And, a single bad sector is not going to cause SMART to report a failure.
"FAILING NOW" is not reported until the drive has nearly exhausted its
supply of spare sectors.

The good news is that this LBA, right at the end of a 6TB (nominal)
drive, is a fairly unlikely location for a LUKS header, so even if
this is an actual bad sector your data should be otherwise recoverable.

The output from "smartctl -A" (or the fuller report from "-a") should
be relevant, in particular the "Current_Pending_Sector" raw value.
A non-zero value there indicates bad sectors that are visible to the
OS and will cause an I/O error when read.

-- 
Bob Nichols     "NOSPAM" is really part of my email address.
                 Do NOT delete it.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-07-08  0:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-06  3:36 [dm-crypt] Help request lacedaemonius
2020-07-07 20:58 ` Michael Kjörling
2020-07-07 23:43   ` Arno Wagner
2020-07-08  0:43   ` Robert Nichols

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).