Errors: NANs occured in hbonding!

Message boards : Number crunching : Errors: NANs occured in hbonding!

To post messages, you must log in.

AuthorMessage
Dougga

Send message
Joined: 27 Nov 06
Posts: 28
Credit: 5,248,050
RAC: 0
Message 53476 - Posted: 31 May 2008, 6:44:08 UTC

I'm getting a strange error on one of my machines.
You can take a look at the details of my Quad Core machine for the nitty gritty.

What does this mean? I can't find any reference to this error using a search on this site.

Any help would be appreicated.

Thanks.

~Doug
ID: 53476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 53481 - Posted: 31 May 2008, 14:26:00 UTC - in response to Message 53476.  

I'm getting a strange error on one of my machines.
You can take a look at the details of my Quad Core machine for the nitty gritty.

What does this mean? I can't find any reference to this error using a search on this site.

Any help would be appreicated.

Thanks.

~Doug

NaN stands for "Not a Number", and happens if you tries dividing by zero or taking the square-root of a negative number and so on.

A quick look shows that on most wu's you're getting this error, other users doesn't seem to have a problem finishing them, so it doesn't look like a problem with wu-parameters. There is some "bad" wu's, but these terminates after a couple seconds, and not 1h+ like most of your errors.
Also, it happens with both "Mini" and "beta"-application, so another mini-bug seems less likely. Running Linux, it's possible a linux-bug, but neither of your 2 other linux-computers seems to have a problem, and atleast one other linux-computer had finished a wu your quad errored-out.

So, this can indicate a hardware-problem with the quad. Check for dust-bunnies, and check your system-temps to see if any over-heating. If you're overclocking, decrease the overclock, since your computer is generating garbage.

To check for cpu-errors, run Gromacs StressCPU or Prime95 torture-test, and run a memory-test to check for memory-errors. Any errors reported by either of these means an unstable computer generating wrong results. You'll most likely not get an error immediately, so it's recommended to run each test for atleast 24 hours.

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 53481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Errors: NANs occured in hbonding!



©2024 University of Washington
https://www.bakerlab.org