Linux Data Integrity Testing
Just a list of some things
Testing suites with integrity testing
http://www.inquisitor.ru/about/
http://www.stresslinux.org/sl/
Memory Testing
For broken memory use Memtest 86+.
Corruption
Random CRC errors? I spent a week working out the fault in this pc. All tests, stress testing (stresslinux, full destructive run on Inquisitor, prime 95, memtest 86+ , everything passed. The only thing I could reliably show a problem was doing this.
for i in $(seq 1 10); do wget -qO- http://xxx.xxx.xx.xx/cyclone/img/dq35.gz | pigz -d - | ntfsclone -rO /dev/sda1 - ; done
Link : Gigabit (lighttpd server on server end), image ~3.6Gb, PC Intel DQ35JOE with 3Gb ( 2x 512Mb and 2x 1Gb modules )
Error: Either a fault coming from ntfsclone "ERROR: Invalid command code in image" , or "ERROR:restore_image: corrupt image" [manifests between about every other run and every 8th run]. or it it gets all the way through, a CRC error from pigz (it checks crc of the gzip image) at end of entire download. I assume a fair amount of corruption triggers ntfsclone to see a problem mid-flow so to speak, whereas perhaps a single bit error will trip up pigz CRC check at the end.
Note,
- Using wget and piping to pigz doing integrity testing (pigz -t) alone didn't fail.
- Copying the image to the hdd on another partition and cat'ing and piping to pigz and then piping to ntfsclone didn't show up a problem either.
- This only manifests chaining all three together which presumably loads the 'system' more?
- Another Intel DQ35JOE model with exactly same spec doing exactly same thing was fine doing above test!
Tried;
- Fine running 2x 512Mb
- Fine running either 1x 1Gb alone
- Failed with all modules in
- Failed with 2x 1Gb.
Also Tried;
- Same problem changing 2x 1Gb blocks for other 2x 1Gb blocks (all Kingston)
- Same problem changing hdd and PSU
- Same problem swapping channels around (1Gb in both blue to both black)
Therefor;
- ok dual channel single sided
- ok single channel double sided
- fail dual channel double sided.
Whats the difference?? well the 1Gb blocks are double sided so one side is using a different connection to the memory controller. Dual channel needs tight matching of the way each DIMM runs (sorry I dont know the technicalities about this.)
I guess this obscure fault is therefor triggered by running double sided memory in dual channel mode and a fault is present in the memory controller. So getting the board RMA'd now.....
Incidently I had a similar issue before (was RAM problem this time) and looping through extracting/integrity testing was only way to bring it up.
- Conclusion - Board replaced with identical, just passed 30 loops no problem :)