Imaging A Arduous Drive With non-ECC Reminiscence
2022-12-30 – By Robert Elder
I lately got here residence to go to with household, however proper now as I write this, it is 1:00am and I am operating memtest86+ on my own within the basement.
The aim of this text, is to supply a real-life case examine to point out why ECC reminiscence isn’t a waste of cash, and to supply an in depth account of how a lot time you may waste when bit flips develop into part of you each day computing life. Having stated that, the earlier sentence is a little bit of lie as a result of this text is definitely about greater than that: Particularly, I’ll focus on numerous learnings that I had within the technique of making an attempt to picture the laborious drive of an outdated failing household laptop, and the method of making an attempt to revive the pc to totally working order. These learnings embrace:
- Is the method of diagnosing failing RAM very quick and straightforward? Nope!
- Do you have to use ‘ddrescue’ or ‘dd’ to picture a failing laborious drive? The reply depends upon whether or not your reminiscence is failing, and it might shock you!
- So, you simply imaged a drive with ‘dd’, however the hash checksum would not match? Take note of copy block sizes!
- Does a budget new DDR3 RAM from eBay or NewEgg work in outdated motherboards? Nope, however generally sure.
- Does elevated warmth improve the chance of reminiscence errors? I feel it does.
Talking of losing time, I wasted enormous period of time on the evaluation described on this article (4-6 hours per 500GB laborious drive md5 checksum and ~12 hours+ per memtest run). That is in all probability the explanation that you do not learn too many articles like this one. It is not as a result of bit flips are uncommon, however moderately as a result of most individuals who encounter them are sensible sufficient to declare “It was in all probability only a ghost in my program!” and reboot the machine as a substitute of investigating the difficulty after which losing much more time writing a weblog submit to doc the difficulty in excruciating element.
My total purpose was to again up the information from an outdated household laptop that was experiencing some ‘issues’. This was really a pc that I personally put collectively again within the 12 months 2008. This is able to make it (2022 – 2008) = 14 years outdated! I discovered it pretty spectacular that it was nonetheless operating, regardless of the reviews of occasional unexplained reboots.
Backing up an outdated laptop… Feels like a easy process, proper? Nicely, enable me to make it extra difficult for you:
Most individuals would in all probability simply use the best backup method of copying components of the filesystem to an exterior flash drive to be sufficient, however in my case I needed to method issues from a radical information restoration perspective. Subsequently, I made a decision to picture the complete laborious disk to ensure an ideal preservation of all information. This laptop was outdated, so I figured it could be cheap to imagine that the laborious drive is perhaps already beginning to fail.
I made a decision to make use of ‘ddrescue’ to picture the drive since I do know that ‘ddrescue’ maintains consciousness of disk learn errors. That is helpful as a result of it lets you verify if there are certainly issues studying from the drive, and the extent of how unhealthy the drive well being is. ‘ddrescue’ additionally makes an attempt to fastidiously learn the disk in a method that ought to protect as a lot information as attainable even when the drive is mechanically failing. You may even return and attempt to re-try sectors that didn’t learn correctly the primary time.
And so, I modified the boot order of the machine, then booted right into a dwell set up of Ubuntu 20 with an connected USB disk to put in writing the picture to. I put in ddrescue, and ran it utilizing the next command:
sudo ddrescue -vv -d -r0 /dev/sdb hd-image-ddrescued.img progress.log
Happily, ddrescue accomplished the imaging course of with out error and exited usually. Instantly after ddrescue completed, I made a decision to compute and md5 checksum of the uncooked disk gadget itself to check it with the checksum of the picture that I had simply obtained:
md5sum /dev/sdb
The ensuing checksum worth was the next:
3d3085c04c3b148f6abb08ceb4b3d6e0 /dev/sdb
I additionally made certain to make use of the ‘sync’ command and correctly unmount the exterior USB storage (the place my copied picture was positioned):
sync
Through the use of the ‘sync’ command, I can be certain that any cached writes of the copied disk picture are dedicated to my exterior laborious drive earlier than I attempt to unmount and take away it.
On one other, a lot quicker laptop, I did a checksum to confirm the saved disk picture, and to my shock I obtained the next:
fc0e287bd2fc9f09c8f48a8ab675294f hd-image-ddrescued.img
That wasn’t what I anticipated (I used to be anticipating it to be 3d3085c04c3b148f6abb08ceb4b3d6e0). I began to marvel if I had achieved one thing incorrect, or possibly I had some incorrect assumptions about operating md5sum straight on the block gadget, or possibly I did not perceive what outcome ddrescue was supposed to supply?
I made a decision to reboot the pc and, once more, boot into the Ubuntu dwell system. All through this course of, I used to be cautious to by no means mount or boot into the disk that I used to be imaging to keep away from accidentially writing to it.
This time I made a decision to make use of the ‘dd’ command to clone the drive. If you happen to do not specify a block measurement to ‘dd’, it can use a price of 512 bytes. If you happen to do this for your self, you will in all probability uncover that the copy progress of ‘dd’ is sort of sluggish with small block measurement like 512 bytes. Now, for those who do a little bit of googling, you will discover that you would be able to explicitly set the block measurement to a bigger quantity to get a duplicate velocity. Subsequently, I selected a block measurement of ’64K’:
dd if=/dev/sdb of=image-dd-cloned.img bs=64K conv=noerror,sync
After I completed making the picture with dd, I then ran one other checksum on the block gadget to be sure that its md5sum had not modified:
md5sum /dev/sdb 3d3085c04c3b148f6abb08ceb4b3d6e0 /dev/sdb
This is similar worth that I obtained the primary time I calculated the checksum of this block gadget. This was re-assuring because it meant that the uncooked information on the disk had not modified even after rebooting the machine.
Subsequent, I ran an md5sum of the disk picture that I simply created with dd:
md5sum image-dd-cloned.img a974938dcfb9eb25121d33cdf673330f image-dd-cloned.img
What?!? That is now 3 completely different hashes, and none of them match! Am I shedding my thoughts?
Then, I began to look a bit extra carefully on the file/gadget sizes:
fdisk reveals the next:
Disk /dev/sdb: 465.78 GiB, 500107862016 bytes, 976773168 sectors Disk mannequin: Hitachi HDP72505 Items: sectors of 1 * 512 = 512 bytes Sector measurement (logical/bodily): 512 bytes / 512 bytes I/O measurement (minimal/optimum): 512 bytes / 512 bytes Disklabel kind: dos Disk identifier: 0x07650765 Gadget Boot Begin Finish Sectors Measurement Id Kind /dev/sdb1 * 2048 206847 204800 100M 7 HPFS/NTFS/exFAT /dev/sdb2 206848 976771071 976564224 465.7G 7 HPFS/NTFS/exFAT
and lsblk -b reveals this:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:0 0 500107862016 0 disk ├─sdb1 8:1 0 104857600 0 half └─sdb2 8:2 0 500000882688 0 half
So at the very least these two instructions agree that the dimensions of this drive is 500107862016 bytes. Now, how massive are the disk pictures that I created?
ls -latr * -rw-r--r-- 1 robert robert 500107862016 Jul 22 19:47 hd-image-ddrescued.img -rw-r--r-- 1 robert robert 500107902976 Jul 23 20:09 image-dd-cloned.img
Okay, that explains why the hash of the picture that was copied utilizing ‘dd’ is completely different. The ‘dd’ed picture measurement is bigger than the dimensions of the block gadget! There isn’t a method the checksums would match (okay certain, it might if there was a hash collision, however this isn’t an article on cryptographically safe hashes).
So why is the picture measurement incorrect when creating the disk picture utilizing ‘dd’, however not ‘ddrescue’? Nicely, this is the primary fascinating take-away from this text:
It has the whole lot to do with the block measurement that I provided to the ‘dd’ command: The block measurement of the drive itself is 512 bytes, and its capability is 500107862016, so 500107862016 / 512 = precisely 976773168 blocks. Nevertheless, when operating ‘dd’, I did not use the default block measurement 512 as a result of that is too sluggish. As a substitute, I specified a block measurement of ’64K’ or 1024 * 64 = 65536 bytes. If you happen to calculate 500107862016 / 65536 you get 7631040.375. Final time I checked, you may’t carry out 0.375 of a tough disk learn, so it is cheap to imagine that this will get rounded as much as 1.0 laborious disk reads, or ‘7631041 block reads of measurement 65536’. However guess what? 7631041 * 65536 = 500107902976 which is precisely the dimensions of the picture that I obtained with the ‘dd’ command! I checked the contents of this ‘further’ information, and it merely accommodates zeros. The (considerably presumptuous) conclusion that I’ll draw from this single expertise is as follows: If the whole measurement of the block gadget isn’t an integer a number of of the block measurement that you simply specify to ‘dd’, then you may count on to get zero-padding on the tip of your picture. On this case, your essential information from the drive isn’t misplaced, however this outcome might result in complicated outcomes for those who ever attempt to copy the picture again (and discover that it would not match), or for those who really confirm your hash checksums like I do. (Word: After publishing this video/article, I obtained an e-mail offering extra perception on this concern, and it probably has to do with the flags that I provided to dd: See 2022-12-29 Note).
So, now let’s fake the tip of the picture previous byte 500107862016 doesn’t exist and re-compute the hash:
$ head -c 500107862016 image-dd-cloned.img | md5sum 3d3085c04c3b148f6abb08ceb4b3d6e0 -
Success! That is precisely the hash worth that I obtained from straight hashing the block gadget!
Okay, we have made some progress getting the dd picture checksum to match, now let’s look into why the ddrescue one did not produce the best hash worth. It is in all probability one other file measurement/block padding concern, proper?
‘ddrescue’ appears to have produced a picture that’s precisely the best measurement, so further padding on the finish cannot be the difficulty. Moreover, the distinction cannot be resulting from an sudden write/modification, as a result of the picture created with the dd command was taken later, and the md5 checksum was in a position to match then! Perhaps I’ve misunderstood one thing elementary about how ddrescue works and constructs pictures?
Let’s examine what the variations are with this ‘cmp’ command:
cmp -lb hd-image-ddrescued.img <(head -c 500107862016 image-dd-cloned.img)
I anticipated to see some enormous part modified within the ddrescue picture, however as a substitute I obtained the next:
52185498315 107 G 307 M-G 73775329995 177 ^? 377 M-^? 86801068747 147 g 347 M-g 181705323211 107 G 307 M-G 186007311051 66 6 266 M-6 207345296075 55 - 255 M-- 235067732683 76 > 276 M-> 290357666507 37 ^_ 237 M-^_ 300967236299 106 F 306 M-F 305192092363 147 g 347 M-g 313683256011 104 D 304 M-D 322162291403 167 w 367 M-w 355933934283 56 . 256 M-. 383209988811 124 T 324 M-T
It is value explaining the above output a bit to verify it is clear, so if we deal with this line:
52185498315 107 G 307 M-G
The above line implies that at 1-based byte offset 52185498315 within the file ‘hd-image-ddrescued.img’, the ‘cmp’ command noticed a byte that may be represented as octal 107 (hex worth 0x47, textual content worth G), however within the different file (or pipe on this case) ‘head -c 500107862016 image-dd-cloned.img’, it noticed a byte with octal quantity 307 (hex worth 0xC7 represented as M-G).
To be much more clear, right here is a straightforward instance of utilizing the cmp command to check the bytes ‘a’ and ‘b’:
cmp -lb <(echo -n 'a') <(echo -n 'b')
and the corresponding output from the above command:
1 141 a 142 b
Now that you know the way to learn the output of the instructions above, do you discover one thing in widespread with the variations in these bytes? Let’s put the output from above in a file known as ‘variations.txt’ and attempt to format the output a bit in another way with this python script:
import re import sys for header in ['Byte Offset', 'Octal Byte #1', 'Byte #1', 'Octal Byte #2', 'Byte #2', 'Binary Byte #1', 'Binary Byte #2', 'Byte #1 XOR Byte #2']: sys.stdout.write("{:<20}".format(header)) sys.stdout.write("n") for line in sys.stdin: components = re.break up("s+", line.strip()) offset_number = int(components[0]) octal_a = int(components[1], 8) octal_b = int(components[3], 8) components.append('{0:08b}'.format(octal_a)) components.append('{0:08b}'.format(octal_b)) components.append('{0:08b}'.format(octal_a ^ octal_b)) for half in components: sys.stdout.write("{:<20}".format(half)) sys.stdout.write("n")
After placing the above script into the file ‘parse_differences.py’, and operating it like this:
cat variations.txt | python3 parse_differences.py
The output is as follows:
Byte Offset Octal Byte #1 Byte #1 Octal Byte #2 Byte #2 Binary Byte #1 Binary Byte #2 Byte #1 XOR Byte #2 52185498315 107 G 307 M-G 01000111 11000111 10000000 73775329995 177 ^? 377 M-^? 01111111 11111111 10000000 86801068747 147 g 347 M-g 01100111 11100111 10000000 181705323211 107 G 307 M-G 01000111 11000111 10000000 186007311051 66 6 266 M-6 00110110 10110110 10000000 207345296075 55 - 255 M-- 00101101 10101101 10000000 235067732683 76 > 276 M-> 00111110 10111110 10000000 290357666507 37 ^_ 237 M-^_ 00011111 10011111 10000000 300967236299 106 F 306 M-F 01000110 11000110 10000000 305192092363 147 g 347 M-g 01100111 11100111 10000000 313683256011 104 D 304 M-D 01000100 11000100 10000000 322162291403 167 w 367 M-w 01110111 11110111 10000000 355933934283 56 . 256 M-. 00101110 10101110 10000000 383209988811 124 T 324 M-T 01010100 11010100 10000000
Pay particular consideration to the column ‘Byte #1 XOR Byte #2’. This columns reveals the bitwise unique OR between every byte that had a distinction within the ‘ddrescue’ picture in comparison in opposition to the picture that was created with ‘dd’ (at the very least the primary 500107862016 bytes of this picture).
The unique OR reveals us a ‘1’ wherever the bits have ‘flipped’. This makes it extremely apparent that the one distinction within the ‘ddrescue’ picture is that these 14 particular person bytes within the 500GB picture file the place the very same excessive little bit of our byte has flipped to 1 when it was presupposed to 0. For this reason the md5 checksum of the ddrescue picture doesn’t match the md5sum of the uncooked block gadget, or the one created utilizing ‘dd’!
At this level, it is cheap to suspect reminiscence errors. Let’s reboot into memtest86 and take a look at out the reminiscence:
Yup, that is a reminiscence error!
However we’re not achieved but. We are able to go even deeper.
The failing deal with within the picture is 0x000784512c8, or 2017792712 in decimal. The ‘Err-Bits’ was reported by memtest as 0x00800000′. This worth of ‘0x00800000’ represents a masks that reveals precisely which bits of reminiscence at deal with 0x000784512c8 contributed to the error. Since this machine makes use of little-endian, the deal with 0x000784512c8 factors to the byte on the right-most finish of the masks ‘0x00800000’. If we break up up this machine phrase into bytes, we get Byte #3: 0x00 Byte #2: 0x80 Byte #1: 0x00 Byte #0: 0x00. and counting upward, one byte at a time, the precise byte the place the single-bit error is positioned is in Byte #2: 0x80.
Subsequently, the precise deal with of the person byte the place there error occurred was at reminiscence deal with 0x000784512c8 + 2, which is 0x000784512ca, or 2017792714 in decimal.
Now, if we predict again to our listing of ‘Byte Offsets’ from the output of the ‘cmp’ command, we are able to search for much more patterns though these would possibly look like random numbers at first:
Byte Offset 52185498315 73775329995 86801068747 181705323211 186007311051 207345296075 235067732683 290357666507 300967236299 305192092363 313683256011 322162291403 355933934283 383209988811
Keep in mind, that the above numbers are byte offsets that inform us the variety of bytes into the laborious drive picture that when the ‘cmp’ command discovered a distinction. These are not addresses in RAM, so they will not have something to do with the deal with of our failing bit in reminiscence… Or do they?
The phrase on the road is that info is learn into the pc’s reminiscence in chunks known as ‘pages’. There’s additionally discuss how these so-called ‘pages’ are sometimes 4K or 4096 bytes in measurement. Let’s modify our earlier python script to point out us the ‘byte offset into the file modulo 4096’ (with a -1 adjustment to account for 1-based indexing):
import re import sys for header in ['(Offset -1) % 4096', 'Byte Offset', 'Octal Byte #1', 'Byte #1', 'Octal Byte #2', 'Byte #2', 'Binary Byte #1', 'Binary Byte #2', 'Byte #1 XOR Byte #2']: sys.stdout.write("{:<20}".format(header)) sys.stdout.write("n") for line in sys.stdin: components = re.break up("s+", line.strip()) offset_number = int(components[0]) octal_a = int(components[1], 8) octal_b = int(components[3], 8) components.insert(0, str((offset_number-1) % (4 * 1024))) components.append('{0:08b}'.format(octal_a)) components.append('{0:08b}'.format(octal_b)) components.append('{0:08b}'.format(octal_a ^ octal_b)) for half in components: sys.stdout.write("{:<20}".format(half)) sys.stdout.write("n")
And the output is now this (take note of the primary column):
(Offset -1) % 4096 Byte Offset Octal Byte #1 Byte #1 Octal Byte #2 Byte #2 Binary Byte #1 Binary Byte #2 Byte #1 XOR Byte #2 714 52185498315 107 G 307 M-G 01000111 11000111 10000000 714 73775329995 177 ^? 377 M-^? 01111111 11111111 10000000 714 86801068747 147 g 347 M-g 01100111 11100111 10000000 714 181705323211 107 G 307 M-G 01000111 11000111 10000000 714 186007311051 66 6 266 M-6 00110110 10110110 10000000 714 207345296075 55 - 255 M-- 00101101 10101101 10000000 714 235067732683 76 > 276 M-> 00111110 10111110 10000000 714 290357666507 37 ^_ 237 M-^_ 00011111 10011111 10000000 714 300967236299 106 F 306 M-F 01000110 11000110 10000000 714 305192092363 147 g 347 M-g 01100111 11100111 10000000 714 313683256011 104 D 304 M-D 01000100 11000100 10000000 714 322162291403 167 w 367 M-w 01110111 11110111 10000000 714 355933934283 56 . 256 M-. 00101110 10101110 10000000 714 383209988811 124 T 324 M-T 01010100 11010100 10000000
And identical to that, these ‘seemingly random’ picture file byte offsets comprise an extremely apparent sample: They’re all congruent to 714 modulo 4096. And, do not forget that actual deal with of the failing reminiscence byte, 0x000784512ca? Go forward and calculate what it is congruent to modulo 4096. I dare you. It is 0x000784512ca % 4096 = 714 in decimal.
If that is not sufficient to persuade you, attempt piping the output above by means of the next awk command:
cat variations.txt | python3 parse_differences.py | awk -F '' '{for(i=1; i<=NF; i++){if((i>(60-(NR-14)*1))&&(i<(86+(NR-14)*1))&&NR<15&&(((i-73)^2/3^2)+((NR-9)^2/1^2)>3)) printf "1"; else printf " "} printf "n"}'
Take one take a look at the output and the reality turns into apparent:
1 111 11111 1111111 111111111 11111111111 11 11 11 11 1111 1111 1111111111111111111 111111111111111111111 11111111111111111111111 1111111111111111111111111
Would not this remind you of one thing?
At this level, we have confirmed that there are in reality ‘bit flips’ occurring in reminiscence, and moreover, we are able to make a exact affiliation between the person bit flips that have been reported by memtest in {hardware}, and the bit flips that we noticed within the ‘ddrescue’ software program as we imaged the laborious drive picture. This affiliation comes from the truth that all of the failing deal with have been each congruent to 714 modulo 4096. The worth ‘714’ (modulo 4096) in relation to the {hardware} itself isn’t important, however the truth that we additionally noticed the very same worth, 714 (modulo 4096), by means of the software program exercise of ‘ddrescue’ looks like greater than a coincidence. It’s cheap to conclude that the ‘ddrescue’ program requested heap reminiscence pages from the working system that have been 4096 bytes in measurement as a short lived storage for information that was coped throughout the imaging technique of the laborious drive. By likelihood, a few of these reminiscence pages simply occurred to incorporate the unhealthy bodily reminiscence deal with famous above (0x000784512ca), inflicting the bits to be flipped and corrupting the ensuing picture.
So, if we obtained 14 bit flips when creating the picture with ‘ddrescue’, how come we obtained 0 bit flips when copying with the ‘dd’ command? That is query, and I haven’t got an amazing reply, however I might speculate that it has one thing to do with how ‘ddrescue’ makes use of heap reminiscence internally in contrast with how ‘dd’ so. The ‘dd’ command is far a a lot less complicated software to do block copies of knowledge with out making an attempt to take care of information constructions to re-construct and retry failed sectors in the identical method that ‘ddrescue’ does. It is probably down the truth that the ‘dd’ command merely makes much less use of heap reminiscence or entry it in a special sample in comparison with the way in which the ‘ddrescue’ command does.
No matter the reason being for the distinction in behaviour between ‘dd’ and ‘ddrescue’, we might argue that you simply would possibly want to make use of one or the opposite software to picture a drive relying on the circumstances. In circumstances the place you think a tough drive to be failing, you would like to picture the drive utilizing ‘ddrescue’. Nevertheless, in circumstances the place you think that the reminiscence is failing, you then would like to picture utilizing the ‘dd’ command. If you happen to suspect that both may very well be failing, you’d picture the drive utilizing each strategies after which confirm by computing the hash of all three: The uncooked block gadget, the dd-based picture and the ddrescue-based picture.
Okay, so at this level, I used to be certain that there are positively reminiscence errors on this laptop. Nevertheless, I have already got my totally intact and verifiably appropriate laborious drive picture regardless of the failing reminiscence, so we’re achieved proper? Nope! It is time to change the reminiscence after which confirm that the {hardware} concern has been fastened!
One of many first modifications that I made to the {hardware} was to interchange the ability provide with a model new one. Energy provides comprise parts like capacitors that may degrade over time.
I did not really do any form of fancy evaluation to guage whether or not the ability provide actually did have any issues (I do not even have the form of gear you’d have to do such a take a look at), however I felt it was secure to imagine that after virtually 15 years of operation, a alternative could not damage.
Changing the ability provide would possibly look like an odd place to start out debugging an issue with the pc’s reminiscence, but when there actually was an issue with the ability provide then you may find yourself with all types of extraordinarily unusual and laborious to debug issues with the remainder of your {hardware}/software program. When the ability provide features appropriately, it ought to convert 120V or 240V AC into nicely filtered 3.3V, 5V or 12V DC. Nevertheless, if the ability provide begins to fail, these DC voltages may very well be too excessive or low, or expertise ripple.
Many skilled laptop customers have heard of the concept that bit flips are extraordinarily uncommon occurrences which are virtually at all times attributable to cosmic rays from area. In my expertise, bit flips are a way more frequent incidence that may sometimes be attributed to a presistent {hardware} concern (like a failing energy provide).
With the ability provide changed, my weeks-long saga of troubleshooting the reminiscence started:
First, I began with a take a look at to substantiate that I nonetheless had reminiscence failure even after changing the ability provide. Certainly, memtest was nonetheless figuring out failures after only a few minutes. For this take a look at, I nonetheless had all 4 reminiscence slots populated with a 1GB stick every. I must also point out that this laptop had a number of failed case followers, so it was operating hotter than regular.
I made a decision to try to use a technique of elimination to establish whether or not there was a single stick, or probably a number of sticks that have been the basis explanation for the issue. I ran one other take a look at with solely a single stick of ram within the slot closest to the CPU. After 40 min, it had 0 errors, so I concluded that this stick is perhaps Okay.
Then, to be further thorough, I put all 4 sticks again in and ran memtest once more. A standard explanation for reminiscence errors is RAM that is not making reference to the slots on the motherboard. Typically, merely taking the RAM out and placing it again in will repair the difficulty. This wasn’t the case right here, as a result of extra errors seems inside a couple of minutes.
Based mostly on the addresses the place the reminiscence errors occurred, I began to suspect that there would possibly really two sticks of RAM that have been unhealthy. I ran one other take a look at the place I saved solely two of those sticks within the motherboard. This time, the take a look at ran for 1 hours 22 minutes with out reporting any errors.
At this level, I used to be beginning to suspect that possibly it wasn’t even the RAM that was unhealthy, however maybe the RAM slots (or the motherboard, or the CPU) that was unhealthy. I ran one other take a look at the place I put the 2 remaining sticks of RAM into the very same slots that I had simply seen working with out errors within the earlier take a look at.
This time I obtained errors once more. So now, I had the errors down to 2 stick of ram within the two slots that have been closest to the CPU. I first examined once more with solely the primary stick of RAM within the slot closest to the CPU. This time, it ran for 1 hours and 46 minutes with no errors. From this, I concluded that the perpetrator in all probability wasn’t this stick or the slot closest to the CPU.
Subsequent, I examined the second of the 2 RAM sticks in second slot away from the CPU. I did get an error simply earlier than the 1 hour mark. So, now I had localized the error down to at least one stick of ram in at the very least one of many slots.
With the intention to try to rule out an issue with the RAM slot itself, I did one other take a look at with this very same stick within the slot closest to the CPU, and obtained one other failure at virtually the three hour mark.
I discovered it a bit odd that it took so much longer to set off errors on this slot, so I did one other equivalent take a look at of this RAM stick on this identical slot whereas the pc was nonetheless heat. Once more, it took about 3 hours for errors to pop up.
With the intention to utterly rule out any involvement of the second RAM slot itself, I made a decision to do much more testing on this slot with one of many different sticks of RAM that I hadn’t but noticed failures on. This take a look at ran for over 10 hours, and didn’t lead to any error. At this level, I concluded that I had recognized at the very least one unhealthy stick of RAM, because it failed with the identical error masks in a number of slots. I put a mark on this follow establish it as stick #1. Moreover, I concluded that the motherboard itself (or the CPU) was not prone to be the basis explanation for the failures. I additionally concluded that the stick that ran for over 10 hours with out error was in all probability okay. I then marked this follow establish it as stick #2.
Subsequent, I took one of many remaining untested sticks and did the identical take a look at on it within the second slot away from the CPU. This stick ran for over 11 hours with none errors. I recognized this stick as #3.
Lastly, I examined the final stick, #4, for 12:47, and didn’t seen any errors. Thus far, I’ve recognized #1 as unhealthy, and #2, #3, and #4 as probably good.
So, the subsequent factor to do is put these three so-called ‘good’ sticks collectively, and take a look at them out as a result of hopefully they need to work, proper? Nope! One failure after solely a hour:
I wasn’t utterly certain about how the RAM addresses mapped to the precise RAM slots, and I had a hunch there is perhaps a problem with the stick in third slot away from the CPU. I made a decision to take out all however this stick within the third slot, and do one other take a look at. I additionally hadn’t examined a lot on this slot earlier than, so I figured it was value a shot. I hoped to get fortunate and catch a failure on this stick, however after an hour there have been nonetheless no errors on this stick.
At this level, it was clear that testing particular person sticks one by one would not shortly reveal extra failures, so I made a decision to maneuver to an method of testing multiple stick directly. I might additionally begin being extra cautious in monitoring the stick and slot numbers, and logging the precise failure addresses.
So, for the primary take a look at, I put sticks within the order #3, #2, #4 and obtained my first failure simply earlier than the two hour mark. The was just one error and its masks was 0x100. The failing deal with was 0x000b08f3204.
For the second take a look at, the sticks had the order #2, #3, #4, and it had failures by the 32 min mark. There have been 7 failures with masks 0x200 at deal with 0x000b4218f9c, and 1 failure with masks 0x4 at deal with 0x000b2cdf93c.
For the third take a look at, the sticks had the order #2, #4, #3, and I famous the primary failure occurred on the 1hr 49min. This take a look at had logged 2 errors with masks 0x4 at deal with 0x000659bf27c.
For the fourth take a look at, the sticks had the order #4, #2, #3, and I famous the primary failure on the 36 min mark. This take a look at had logged 1 error with masks 0x100 at deal with 0x000b08f3204. As I began to evaluate the information from all 4 assessments, I spotted that this was in all probability the worst attainable outcome:
From the above take a look at outcomes, you may see that take a look at #1 and take a look at #4 each have a failure at the very same deal with with the very same bit masks. If you happen to take a look at the RAM slot orderings for these two assessments, you may see that the one commonality between them is place of the #2 stick. If you happen to assume that stick #2 is the one unhealthy stick amongst these three (or slot) you then would count on to additionally see a transparent sample between the failures in take a look at #2 and take a look at #3. The #2 stick is in the identical location between take a look at #2 and take a look at #3, however clearly the failures between these two assessments present places which are in several megabytes.
From this, I concluded the next: The remaining {hardware} error in these final 3 sticks of RAM can’t be resulting from a single unhealthy stick of RAM. It should both be resulting from a number of unhealthy sticks of RAM, or some form of much less apparent common-mode drawback. There could also be a problem motherboard’s skill to refresh the RAM states when greater than a pair RAM sticks are current. Maybe there’s a leaking capacitor someplace on the motherboard, or possibly the presence of a number of sticks of RAM causes an intolerably lengthy improve within the time between refreshes.
I additionally have not given a lot consideration towards the choice of RAM timings, voltages, and frequencies. Thus far, I’ve simply relied to BIOS to mechanically choose the best default values, and it’s past my capabilities to pick higher values.
At this level, I felt that I had achieved nearly the whole lot I might do to isolate which sticks of RAM have been unhealthy, and that it was in all probability value giving up on all 4 of the unique sticks of RAM.
For this subsequent little bit of {hardware} enchancment, I made a decision to interchange all of the thermal paste and set up a brand new CPU fan and change all of the case followers:
Now, with all of the followers changed, I made a decision to start out up memtest once more with these three sticks and see if the cooler temperature had a lot affect on the speed at which errors seem.
I let this take a look at run for nearly 18 hours, and it solely recorded 1 error throughout that point. That is contrasted with the very same take a look at that I did earlier than the place I might often get a number of errors throughout the first couple hours. From this, I conclude that letting your laptop get very popular is a one very probably explanation for elevated reminiscence errors.
Again once I had seen the primary memtest error, I made a decision to order some new ram from china and since that lastly arrived, I decoded to try it out too. It price me $30 together with transport to get these 4 model new sticks direct from China:
I began out by placing all 4 sticks of RAM in, and tried besides it up. Sadly, the motherboard did not like this RAM and I solely obtained POST beeping codes for reminiscence points. Then, I eliminated 3 of the sticks in order that just one remained, and tried once more. It nonetheless simply gave me beeping codes for reminiscence points. Round this time, I began doing extra analysis and found that there’s one other essential variable to contemplate when shopping for RAM, and that’s the ‘RAM density’. Apparently, older motherboards can not deal with newer excessive density RAM. It is a drawback as a result of a lot of the low-cost RAM that you simply discover on-line doesn’t give any indication of what it is density is, so there is no solution to know if it can work till you purchase it and put it within the motherboard.
I made a decision to present it one other shot, so I purchased some extra low-cost RAM off of eBay:
This right here is meant to be 2 sticks of 2GB every. From the label on the RAM it seems to be prefer it would possibly really be 4GB sticks, however the description assured me that they have been certainly 2GB sticks, that are presupposed to work on this motherboard. For the document, the motherboard is an Intel DP45SG.
I put in this RAM and located that it additionally would not work. I attempted just a few occasions with just one stick and I additionally tried re-arranging the RAM slots a pair occasions, however this solely produced extra POST error codes.
Naturally, the subsequent step was to purchase much more RAM off of eBay:
I consider these ones have been really used sticks. They’re 1GB every, and they seem like a blended lot. For this set it seems to be like there’s a excessive chance that the RAM timings, voltage and density might all be completely different and probably incompatible. I made a decision to start out by not paying any consideration to this element, and simply put them in all of sudden and see what occurs.
Clearly, it would not work with all 3 sticks of RAM collectively, so I examined out every stick individually, and to my shock, all three of those sticks work!
After trying a bit extra fastidiously on the voltages and timings of those sticks of RAM, there are two of them that look like pretty related, so I made a decision to attempt putting in each of them collectively.
And thankfully, it did handle besides up efficiently and I used to be in a position to get into memetest:
I made a decision to let memtest run with these sticks for 50 hours, and after 69 passes with 0 failures, I feel it is time to declare this a hit. This 2GB of reminiscence is so much lower than the 8GB that I needed to put in on this machine, however given how laborious it’s to seek out outdated RAM that truly works, I am going to need to declare this as ‘ok’.
So what sort of conclusions can we draw from the expertise described above? There are quite a lot of mini-lessons and learnings, however essentially the most important concept that I need to impart to the reader is the dimensions of how a lot time I wasted on debugging this. With low-cost client non-ECC reminiscence, your bits can flip right here and there, and you will don’t know! The primary indication that I had that there was even an issue was a silently corrupted laborious drive picture. To even detect this, I wanted the endurance and self-discipline to confirm the checksum on a 500GB file! Think about how far more time I might have wasted if I did not trouble to confirm the checksum and made use of an essential enterprise doc that contained one of many 14 bit flips?
Now a few of you would possibly suppose that it is a bit melodramatic to even take into account it a giant deal when solely 14 bits per 500GB of knowledge are corrupted, however that is actually what defines the distinction a person that wants ECC reminiscence and an informal person who would not. Some individuals want computer systems to supply precisely appropriate outcomes on a regular basis, and a few individuals are okay with simply rebooting the machine when it mis-behaves after which blaming the issue on ‘ghosts’.
Having stated this, the efficiency and value benefit of utilizing consumer-grade non-ECC reminiscence in comparison with ECC reminiscence is getting smaller and smaller yearly. Hopefully, one in every of today, reminiscence and CPU producers will lastly chew the bullet and declare “We’re completely discontinuing non-ECC reminiscence and CPU manufacturing eternally. Shifting ahead, all client grade reminiscence and CPUs will use error correction.”. As soon as this occurs, we are able to lastly cease having conversations about this matter, and folks like me will not want to put in writing articles like this one.
Okay, now time to go upstairs and see what everybody else within the household is as much as. Oops, everyone went to mattress already.
After publishing the video above, I obtained an e-mail from somebody who urged a proof for the padding concern that I encountered above. The e-mail reads as follows:
Hello, I simply noticed your video on the non-ECC RAM corrupted laborious drive picture. I do not like commenting on social media, so I assumed I might ship you an e-mail as a substitute.
The explanation your dd command creates a picture measurement that doesn’t match the dimensions of your block gadget is since you’re operating “dd conv=sync”. The sync conversion possibility explicitly tells dd to pad each enter block with NULs as much as the enter block measurement, which is the behaviour you are seeing. I assume you are making an attempt to inform dd to make use of synchronous I/O, however that is a flag, not a conversion possibility.
Whereas that is documented within the dd man web page, it’s actually complicated.
If you happen to actually need synchronous I/O, attempt:
dd if=/dev/no matter of=picture bs=64k conv=noerror iflag=sync oflag=sync
Though I’ve by no means used synchronous I/O in dd, I simply run “sync” afterwards as you do since I like buffering and caching throughout the switch 🙂
No affiliation with dd, gnu, youtube or the rest. I simply noticed your video and needed to clear it up, since I acknowledged the issue.
Have one, Andreas
–Finish Of E mail–
I need to admit that I’ve by no means taken the time to realize an in depth understanding of all of the flags that ‘dd’ helps, and I’ve not personally taken the time to confirm the options within the e-mail described above. Nevertheless, I’ll preserve this be aware right here as it might assist others (and I’ll probably do this out myself a while sooner or later).
|