{"id":215,"date":"2016-08-27T07:48:57","date_gmt":"2016-08-27T06:48:57","guid":{"rendered":"https:\/\/tollana.d-tor.org\/notes-to-self\/?p=215"},"modified":"2017-08-16T12:52:16","modified_gmt":"2017-08-16T11:52:16","slug":"dusting-off-the-array-part-2","status":"publish","type":"post","link":"https:\/\/tollana.d-tor.org\/notes-to-self\/?p=215","title":{"rendered":"Dusting off the Array! (Part 2)"},"content":{"rendered":"<p>Craptastic^2! Another drive failed as of Thursday morning\u00a0during backup (2016\/08\/25). The box hung hard, the SATA bus was completely b0rked, so the process list was filling up with defunct smartctl commands, driving the load towards 100&#8230;<\/p>\n<p>OK, no problem, one hard reset later the array was rebuilding. So far, so good, but during the next backup the array failed again, which was kinda expected. In hindsight I should have disabled the job, though. Anyway, Friday morning the box was locked up hard again. Poweroff hung at unmounting the array, no progress at all, so I just turned it off.<\/p>\n<p>Friday afternoon I replaced the failed disk, booted up and was in deep shit! mdadm told me that it cannot start a dirty degraded array. FUCK! There goes my data, I thought&#8230; But Google came to rescue!<\/p>\n<p>Fortunately mdadm allows you to force-assemble a dirty, degraded array with:<\/p>\n<pre># mdadm --assemble --force \/dev\/md1 \/dev\/sd[ghj] missing<\/pre>\n<p>Or so I thought. That command exited with an I\/O-Error, because the drives were for busy for some reason.<\/p>\n<pre># cat \/sys\/block\/md1\/md\/array_state \u00a0\r\ninactive<\/pre>\n<p>As turned out, inactive is kinda still active. You have to stop the array first to get it working again:<\/p>\n<pre># mdadm -S \/dev\/md1<\/pre>\n<p>Only then it can be force-assembled with the aforementioned command. Once it&#8217;s up and running (degraded), add the new disk:<\/p>\n<pre># mdadm --manage --add \/dev\/md1 \/dev\/sdi<\/pre>\n<p>Now it should be rebuilding. Cross your fingers and pray to whatever god you worship \ud83d\ude42 Of course the array was shut down Saturday morning, because I still didn&#8217;t disable the backup job, but this time it shut down cleanly. One reboot later the rebuild continued&#8230;<\/p>\n<p>I guess I was very, very, very lucky: As far as I can tell there was\u00a0mostly\u00a0read access up to the 2nd failure (backup). The file systems (all XFS) mounted after recovering from the transaction logs, and the data seems to be OK, but I&#8217;ll see&#8230;<\/p>\n<h3>Lessons learned<\/h3>\n<ul>\n<li>Always shut down the array <strong>cleanly<\/strong> at the first sign of trouble! Don&#8217;t wait until the drive fails completely!<\/li>\n<li>Don&#8217;t think that the failing drive will recover during rebuild. It won&#8217;t! It&#8217;ll only make things worse.<\/li>\n<li>SEAGATE Barracuda drives, esp. ST3000DM001, are, to put it mildly, crap! I didn&#8217;t keep track of the history, but I think I replaced each of them <strong>at least<\/strong> once. So\u00a0I\u00a0ordered a\u00a0 HGST 0S03665 Deskstar NAS 4TB 6Gb\/s SATA as replacement instead of the cheaper (and smaller) SEAGATE drive. Let&#8217;s see how that turns out&#8230;<\/li>\n<li>An inactive array can still be busy, e.g. active and has to be stopped before you can force anything&#8230;<\/li>\n<li>Keep an up-to-date\u00a0<a href=\"https:\/\/tollana.d-tor.org\/notes-to-self\/?p=21\">list of drives<\/a>, their serials and position in the external SATA casing, so you don&#8217;t have to guess which drive failed!<\/li>\n<\/ul>\n<p><strong>Update (2016\/08\/27 5:23pm):<\/strong>\u00a0Fuck SEAGATE! Once again a supposedly new drive almost failed me! At 99.9% rebuild the array shut down and I had to reboot, due to:<\/p>\n<pre>Aug 27 16:43:50 hadante kernel: ata5.02: exception Emask 0x100 SAct 0x7fffbfff SErr 0x0 action 0x6 frozen \r\nAug 27 16:43:50 hadante kernel: ata5.02: failed command: WRITE FPDMA QUEUED \r\nAug 27 16:43:50 hadante kernel: ata5.02: cmd 61\/40:00:a0:9b:71\/05:00:5c:01:00\/40 tag 0 ncq 688128 out \r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0res 40\/00:ff:00:00:00\/00:00:00:00:00\/40 Emask 0x4 (timeout) \r\nAug 27 16:43:50 hadante kernel: ata5.02: status: { DRDY }<\/pre>\n<p>After the reboot, the array rebuilt successfully, though. I&#8217;ll replace the failing (new) drive with the HITACHI when it arrives, and if that works, I&#8217;ll replace all drives, I think&#8230;<\/p>\n<p><a href=\"https:\/\/tollana.d-tor.org\/notes-to-self\/?p=196\">Part 1<\/a><br \/>\n<a href=\"https:\/\/tollana.d-tor.org\/notes-to-self\/?p=224\">Part 3<br \/>\n<\/a><a href=\"https:\/\/tollana.d-tor.org\/notes-to-self\/?p=337\">Part 4<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Craptastic^2! Another drive failed as of Thursday morning\u00a0during backup (2016\/08\/25). The box hung hard, the SATA bus was completely b0rked, so the process list was filling up with defunct smartctl commands, driving the load towards 100&#8230; OK, no problem, one hard reset later the array was rebuilding. So far, so good, but during the next &hellip; <a href=\"https:\/\/tollana.d-tor.org\/notes-to-self\/?p=215\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Dusting off the Array! (Part 2)<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[116,79,77],"tags":[58,11],"class_list":["post-215","post","type-post","status-publish","format-standard","hentry","category-dusting-off-the-array","category-hardware","category-linux","tag-mdadm","tag-raid"],"_links":{"self":[{"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=\/wp\/v2\/posts\/215","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=215"}],"version-history":[{"count":6,"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=\/wp\/v2\/posts\/215\/revisions"}],"predecessor-version":[{"id":343,"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=\/wp\/v2\/posts\/215\/revisions\/343"}],"wp:attachment":[{"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tollana.d-tor.org\/notes-to-self\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}