From mboxrd@z Thu Jan 1 00:00:00 1970 From: Phil Turmel Subject: Re: broken raid level 5 array caused by user error Date: Tue, 19 Jan 2016 12:51:43 -0500 Message-ID: <569E77AF.6040906@turmel.org> References: <3c05d813e42324cdf95989784f6d7b17@pingofdeath.de> <56426499.8000205@turmel.org> <564284F5.9080409@turmel.org> <56429326.5030405@turmel.org> <44d28ec402622b25c4d4d7a32a8888d9@pingofdeath.de> <569D384F.6070208@turmel.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------080302030603020107030403" Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: Mathias Mueller Cc: Linux raid , linux-raid-owner@vger.kernel.org List-Id: linux-raid.ids This is a multi-part message in MIME format. --------------080302030603020107030403 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Hi Mathias, On 01/19/2016 09:35 AM, Mathias Mueller wrote: > Hi Phil, > > I forgot to add some information: when I was creating the bytestrings > from my jpg file, I did not start from 0k but from 100k of the jpg file > (to skip the jpg header). Ok. But I'm still not confident of chunk boundaries. >> Very interesting. You could go one step further and compare the jpeg >> file contents in the first 1M against the locations found to determine >> where the chunks actually start and end on each device. The final >> offset will be a chunk multiple before these boundaries. Or do md5 sums >> of 4k blocks to reduce the amount to inspect. > > How exactly can I do this? Should I create more Bytestrings and do more > brep with them on my physical devices? I have already results from > searching bytestrings with an offset of 64k (starting from 100k to 612k > of my jpeg file, so 9 bytestrings at all). Should I provide a table of > the results? Sigh. I couldn't help myself. New utility attached. Curse you Mathias for an interesting problem! ;-) Call it with your jpeg and the devices to search, like so: findHash.py /path/to/picture.jpeg /dev/sd[bcde] It'll make a map of hashes of each 4k block in the jpeg and then search the listed devices for those hashes, building a map of the file fragments. This will clearly show chunk boundaries. Please show the output. Phil --------------080302030603020107030403 Content-Type: text/x-python; name="findHash.py" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="findHash.py" #! /usr/bin/python2 # # Locate 4k fragments of a subject file in one or more other files or # devices. Only reports two or more consecutive matches. # # Usage: # findHash.py /path/to/subject/file /dev/sdx|/path/to/image/file [/dev/sdy ...] import hashlib, sys, datetime # Read the known file 4k at a time, building a dictionary of # md5 hashes vs. offset. Use a large buffer for speed. # Drops any partial block at the end of the file. d = {} pos = long(0) f = open(sys.argv[1], 'r', 1<<20) b = f.read(4096) while len(b)==4096: md5 = hashlib.md5() md5.update(b) h = md5.digest() hlist = d.get(h) if not hlist: hlist = [] d[h] = hlist # print "New hash %s at %8.8x" % (h.encode('hex'), pos) hlist.append(pos) pos += 4096 b = f.read(4096) f.close() print "%d Unique hashes in %s" % (len(d), sys.argv[1]) def checkAndPrint(match): if match[2]>4096: print "%20s @ %12.12x:%12.12x ~= %8.8x:%8.8x" % (fname, match[1], match[1]+match[2]-1, match[0], match[0]+match[2]-1) # Read the candidate files/devices, looking for possible matches. Match # entries are vectors of known file offset, candidate file offset, and # length. for fname in sys.argv[2:]: print "\nSearching for pieces of %s in %s:..." % (sys.argv[1], fname) pos = long(0) f = open(fname, 'r', 1<<24) matches = [] b = f.read(4096) lastts = None while len(b)==4096: if not (pos & 0x7ffffff): ts = datetime.datetime.now() if lastts: print "@ %12.12x %.1fMB/s \r" % (pos, 128.0/((ts-lastts).total_seconds())), else: print "@ %12.12x...\r" % pos, sys.stdout.flush() lastts = ts md5 = hashlib.md5() md5.update(b) h = md5.digest() if h in d: i = 0 while i