Linux Data Integrity Testing

From richud.com
Jump to navigation Jump to search


Just a list of some things Testing suites with integrity testing

http://www.iozone.org/

http://www.inquisitor.ru/about/

http://www.stresslinux.org/sl/


Memory Testing

For broken memory use Memtest 86+.


Corruption

Random CRC errors? I spent a week working out the fault in this pc. All tests, stress testing (stresslinux, full destructive run on Inquisitor, prime 95, memtest 86+ , everything passed. The only thing I could reliably show a problem was doing this.

for i in $(seq 1 10); do wget -qO- http://xxx.xxx.xx.xx/cyclone/img/dq35.gz | pigz -d - | ntfsclone -rO /dev/sda1 - ; done

Link : Gigabit (lighttpd server on server end), image ~3.6Gb, PC Intel DQ35JOE with 3Gb ( 2x 512Mb and 2x 1Gb modules )

Error: Either a fault coming from ntfsclone "ERROR: Invalid command code in image" , or "ERROR:restore_image: corrupt image" [manifests between about every other run and every 8th run]. or it it gets all the way through, a CRC error from pigz (it checks crc of the gzip image) at end of entire download. I assume a fair amount of corruption triggers ntfsclone to see a problem mid-flow so to speak, whereas perhaps a single bit error will trip up pigz CRC check at the end.


Note,

  • Using wget and piping to pigz doing integrity testing (pigz -t) alone didn't fail.
  • Copying the image to the hdd on another partition and cat'ing and piping to pigz and then piping to ntfsclone didn't show up a problem either.
  • This only manifests chaining all three together which presumably loads the 'system' more?
  • Another Intel DQ35JOE model with exactly same spec doing exactly same thing was fine doing above test!

Tried;

  • Fine running 2x 512Mb
  • Fine running either 1x 1Gb alone
  • Failed with all modules in
  • Failed with 2x 1Gb.

Also Tried;

  • Same problem changing 2x 1Gb blocks for other 2x 1Gb blocks (all Kingston)
  • Same problem changing hdd and PSU
  • Same problem swapping channels around (1Gb in both blue to both black)

Therefor;

  • ok dual channel single sided
  • ok single channel double sided
  • fail dual channel double sided.


Whats the difference?? well the 1Gb blocks are double sided so one side is using a different connection to the memory controller. Dual channel needs tight matching of the way each DIMM runs (sorry I dont know the technicalities about this.)

I guess this obscure fault is therefor triggered by running double sided memory in dual channel mode and a fault is present in the memory controller. So getting the board RMA'd now.....

Incidently I had a similar issue before (was RAM problem this time) and looping through extracting/integrity testing was only way to bring it up.


  • Conclusion - Board replaced with identical, just passed 30 loops no problem :)