Multiple Computation Errors

Message boards : Number crunching : Multiple Computation Errors

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Mark
Avatar

Send message
Joined: 1 Dec 12
Posts: 10
Credit: 20,184
RAC: 0
Message 74865 - Posted: 9 Jan 2013, 0:24:01 UTC
Last modified: 9 Jan 2013, 0:39:11 UTC

I noticed that my pile of work all errored out. Apparently once it started, each task I had ended in error until all of the units were gone. I found:
1/6/2013 8:23:09 PM | rosetta@home | [error] Signature verification failed for minirosetta_graphics_3.43_windows_x86_64.exe
in the message log.

I performed a backup with Acronis True Image Home to an external drive that might have been at that time.

Could the backup cause the errors?

I've never had this problem with Seti@Home.

Thanks,
Mark
ID: 74865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 74869 - Posted: 9 Jan 2013, 14:50:55 UTC - in response to Message 74865.  

I noticed that my pile of work all errored out. Apparently once it started, each task I had ended in error until all of the units were gone. I found:
1/6/2013 8:23:09 PM | rosetta@home | [error] Signature verification failed for minirosetta_graphics_3.43_windows_x86_64.exe
in the message log.

I performed a backup with Acronis True Image Home to an external drive that might have been at that time.

Could the backup cause the errors?

I've never had this problem with Seti@Home.

Thanks,
Mark


Looks to me like the screensaver/graphics file was corrupted. Doing a project reset should download another copy.
ID: 74869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 74871 - Posted: 9 Jan 2013, 17:54:19 UTC

Right, a project reset will fetch a fresh copy of what appears to be a corrupted file, however the root cause of the corruption might be a firewall or anti-virus application. You may need to "white list" the BOINC and the Rosetta application.
Rosetta Moderator: Mod.Sense
ID: 74871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Mark
Avatar

Send message
Joined: 1 Dec 12
Posts: 10
Credit: 20,184
RAC: 0
Message 74874 - Posted: 10 Jan 2013, 4:16:01 UTC

I reset the Project. Nothing has downloaded since no work comes in. For some reason my client only asks for ATI work and only rarely for CPU work. We'll see eventually if the reset worked. Thanks for the quick response!
ID: 74874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Mark
Avatar

Send message
Joined: 1 Dec 12
Posts: 10
Credit: 20,184
RAC: 0
Message 74929 - Posted: 18 Jan 2013, 20:13:40 UTC

I've received and successfully crunched several units. Looks like the reset worked. Thanks.
ID: 74929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BONNSaR

Send message
Joined: 3 Nov 05
Posts: 3
Credit: 8,983,633
RAC: 0
Message 75314 - Posted: 4 Apr 2013, 6:11:02 UTC

I have a different cause for all tasks ending up Computation Error. I'm running a Windows 7 64bit PC using an I5 3570K overclocked. If I overclock over 4.5GHz then all tasks Computation Error - at this over clock the PC is otherwise stable and runs other applications in a stable fashion without apparent errors.

If I throttle back to 4.4Ghz the Rosetta tasks run normally without errors.

Any ideas about the Computation Errors at 4.5GHz overclock ?????
ID: 75314 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen Miller

Send message
Joined: 18 Sep 05
Posts: 13
Credit: 16,294,215
RAC: 0
Message 75315 - Posted: 4 Apr 2013, 7:17:10 UTC - in response to Message 75314.  

I have a different cause for all tasks ending up Computation Error. I'm running a Windows 7 64bit PC using an I5 3570K overclocked. If I overclock over 4.5GHz then all tasks Computation Error - at this over clock the PC is otherwise stable and runs other applications in a stable fashion without apparent errors.

If I throttle back to 4.4Ghz the Rosetta tasks run normally without errors.

Any ideas about the Computation Errors at 4.5GHz overclock ?????


Try running Prime95 torture test from
this website: http://www.mersenne.org/freesoft/

I remember reading from their website that while an overclocked computer appears to be stable, it outputs garbage for scientific work. Hence the torture test. If you can't get Prime95 to run flawless for hours/days, you are unstable.

Once you run the torture test for hours/days, then you have a stable system.

In my testing, running successfully for 4+ hours is a leading indicator of stability. Longer is better, especially if your ambient temperatures vary over time; that is, runs while room is cool, fails when room is hot.

ID: 75315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BONNSaR

Send message
Joined: 3 Nov 05
Posts: 3
Credit: 8,983,633
RAC: 0
Message 75327 - Posted: 7 Apr 2013, 7:26:23 UTC - in response to Message 75315.  

I have a different cause for all tasks ending up Computation Error. I'm running a Windows 7 64bit PC using an I5 3570K overclocked. If I overclock over 4.5GHz then all tasks Computation Error - at this over clock the PC is otherwise stable and runs other applications in a stable fashion without apparent errors.

If I throttle back to 4.4Ghz the Rosetta tasks run normally without errors.

Any ideas about the Computation Errors at 4.5GHz overclock ?????


Try running Prime95 torture test from
this website: http://www.mersenne.org/freesoft/

I remember reading from their website that while an overclocked computer appears to be stable, it outputs garbage for scientific work. Hence the torture test. If you can't get Prime95 to run flawless for hours/days, you are unstable.

Once you run the torture test for hours/days, then you have a stable system.

In my testing, running successfully for 4+ hours is a leading indicator of stability. Longer is better, especially if your ambient temperatures vary over time; that is, runs while room is cool, fails when room is hot.

Hi Stephen

Thanks for the advice, as expected Prime95 did fail at 4.5GHz. I've managed with a few adjustments to get the PC rock stable on Prime95 at 4.5GHz but still Rosetta shows Computation Error on all tasks at this speed. I have gone back to 4.4GHz and Rosetta runs ok. Not to worry 4.4 is a significant gain in RAC over stock non OC so I'm happy at this.

ID: 75327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,821,902
RAC: 15,180
Message 75341 - Posted: 9 Apr 2013, 14:36:12 UTC

Which Prime95 tests were you running? If you were running the "small" or "large" tests then it might be that the memory (inc L3 cache I think) is struggling at 4.5 because Rosetta taxes that quite heavily but P95 doesn't on the first two. Blend is probably more appropriate (although of course it should be stable on all of them!).

HTH
Danny
ID: 75341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 3,249
Message 75354 - Posted: 12 Apr 2013, 12:38:02 UTC - in response to Message 74871.  

Right, a project reset will fetch a fresh copy of what appears to be a corrupted file, however the root cause of the corruption might be a firewall or anti-virus application. You may need to "white list" the BOINC and the Rosetta application.


I too am getting LOTS of 'computation errors, on several machines that previously had no problems at all. I have moved a couple machines elsewhere, the rest are only getting one or two errors, not whole batches of them.
ID: 75354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Abriata

Send message
Joined: 19 Jan 13
Posts: 2
Credit: 953,454
RAC: 0
Message 75358 - Posted: 12 Apr 2013, 20:41:56 UTC - in response to Message 75354.  

Right, a project reset will fetch a fresh copy of what appears to be a corrupted file, however the root cause of the corruption might be a firewall or anti-virus application. You may need to "white list" the BOINC and the Rosetta application.


I too am getting LOTS of 'computation errors, on several machines that previously had no problems at all. I have moved a couple machines elsewhere, the rest are only getting one or two errors, not whole batches of them.


I am getting tons of erros recently too! On all my computers, what's more when I check the work unit I see that the same task has ended with computation error on the other computer where it was processed. What's going on? I'm losing CPU time AND credits
ID: 75358 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kenneth DePrizio

Send message
Joined: 15 Jul 07
Posts: 15
Credit: 3,123,915
RAC: 0
Message 75361 - Posted: 12 Apr 2013, 23:36:37 UTC

There appear to be a bunch of bad workunits in the queue. All these "cryo" units for example are erroring out. Just gotta wait for them to cycle through.
ID: 75361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
288VKYUjwsXfAaTXn6SFJC4LVPRf

Send message
Joined: 16 Dec 05
Posts: 31
Credit: 153,110
RAC: 0
Message 75363 - Posted: 13 Apr 2013, 6:26:16 UTC

Same computation errors here. My computer almost freezes because of these WU's. Something serious wrong there.
ID: 75363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 3,249
Message 75366 - Posted: 13 Apr 2013, 11:55:21 UTC - in response to Message 75361.  

There appear to be a bunch of bad workunits in the queue. All these "cryo" units for example are erroring out. Just gotta wait for them to cycle through.


Mine that start with 'E6' are the ones erroring out, is that the 'cryo' ones?
ID: 75366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 3,249
Message 75368 - Posted: 13 Apr 2013, 15:36:59 UTC - in response to Message 75366.  

There appear to be a bunch of bad workunits in the queue. All these "cryo" units for example are erroring out. Just gotta wait for them to cycle through.


Mine that start with 'E6' are the ones erroring out, is that the 'cryo' ones?


Never mind I found the units starting with "cryo" are the ones having MAJOR problems for me too, I have gone thru and aborted all of them!
ID: 75368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Col323

Send message
Joined: 12 Apr 13
Posts: 2
Credit: 1,213,458
RAC: 0
Message 75381 - Posted: 16 Apr 2013, 20:22:40 UTC

I just joined and found my machine giving a lot of errors. At first I was worried about the machine, then I found this thread. When checking the logs of the units which give a compute error, they are the cryo units and they always contain the lines:

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory


I watched one run, and while most Rosetta units use 300-500MB, this one hit 1.7GB before crashing. This is on a box with 7 GB for 4 cores. I'm trying Rosetta now on a box with 28GB and 4 cores to see if they complete. Of course, I probably won't get any cryo units sent to this machine.
ID: 75381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Col323

Send message
Joined: 12 Apr 13
Posts: 2
Credit: 1,213,458
RAC: 0
Message 75382 - Posted: 17 Apr 2013, 2:00:08 UTC

I did not get to watch, but the 28GB machine choked on a cryo unit as well. Same "out of memory" message.

I guess I'll just let them keep bombing away and abort them if I see them.
ID: 75382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 3,249
Message 75383 - Posted: 17 Apr 2013, 10:49:09 UTC - in response to Message 75382.  

I did not get to watch, but the 28GB machine choked on a cryo unit as well. Same "out of memory" message.

I guess I'll just let them keep bombing away and abort them if I see them.


I am still aborting mine as I see them!
ID: 75383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 3,249
Message 75384 - Posted: 17 Apr 2013, 13:25:10 UTC - in response to Message 75383.  

I did not get to watch, but the 28GB machine choked on a cryo unit as well. Same "out of memory" message.

I guess I'll just let them keep bombing away and abort them if I see them.


I am still aborting mine as I see them!


Does anyone know how to STOP getting the cryo tasks? I abort 5 and they send me 5 right back again! I do NOT want to stop running Rosetta but do NOT want to be wasting my time either!! So far I have no problems running the other types, but EVERY cryo unit fails!!!
ID: 75384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 75385 - Posted: 17 Apr 2013, 16:44:17 UTC

One idea would be to set to a large cache of work, pull down a pile of tasks, and then remove the ones you don't want. Then you'll at least have a longer period of time where you can run without feeling the need to check task names again.
Rosetta Moderator: Mod.Sense
ID: 75385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Multiple Computation Errors



©2024 University of Washington
https://www.bakerlab.org