Majority of WUs fail (and can't install v.6.4.5 as a service)

Message boards : Number crunching : Majority of WUs fail (and can't install v.6.4.5 as a service)

To post messages, you must log in.

AuthorMessage
ads

Send message
Joined: 27 Jan 09
Posts: 7
Credit: 1,745,369
RAC: 0
Message 60251 - Posted: 21 Mar 2009, 10:26:36 UTC

So most of my WUs fail, output files being absent, and I don't really know why.

What I've gathered from the forums thus far is that there might be a hardware problem, I believe so since I have had random restarts and received "serious error" messages from windows, though infrequently, ever since I assembled the machine. I've been running Folding@home for two years without problems, though.

Temperatures are what I consider normal, ~52C on a Intel Core2duo 6300 @1.86GHz, which is about 10C less compared to when I ran F@H. Not OC'd

Apart from this I think I'm clueless. I consider reassembling it as a last resort.

---
I also can't install the as of now latest boinc client software as a service. I can do so with v.6.2.19.

The way I believe I should be able to manage to do so and that works with the earlier software is to enable protected application execution mode during install.

ID: 60251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,179,826
RAC: 3,209
Message 60270 - Posted: 22 Mar 2009, 10:09:42 UTC - in response to Message 60251.  

So most of my WUs fail, output files being absent, and I don't really know why.

What I've gathered from the forums thus far is that there might be a hardware problem, I believe so since I have had random restarts and received "serious error" messages from windows, though infrequently, ever since I assembled the machine. I've been running Folding@home for two years without problems, though.

Temperatures are what I consider normal, ~52C on a Intel Core2duo 6300 @1.86GHz, which is about 10C less compared to when I ran F@H. Not OC'd

Apart from this I think I'm clueless. I consider reassembling it as a last resort.

---
I also can't install the as of now latest boinc client software as a service. I can do so with v.6.2.19.

The way I believe I should be able to manage to do so and that works with the earlier software is to enable protected application execution mode during install.


NO do NOT enable the protected application instruction mode, uncheck it, when you install. I have done this for several machines and it works fine.

You said you are having intermittent problems with the pc...have you run MS's memory tester yet? It is not fast and you should shut down Boinc when you run it, but it might point to a problem area. Intermittent problems are the absolute worst to figure out, it happens when no one is there but the machine works fine when you are trying to figure it out!! Have you done all the MS updates since you built it? I am guessing you are also running an anti-virus and anti-spyware program too. That would rule out outside influences. 52c is pretty cool, my laptop here is running at 70c and that is on top of a cooler!
Version 6.4.5 still has some bugs in it, you might be better off with 6.2.19! You are the Administrator on the pc's, right?
ID: 60270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ads

Send message
Joined: 27 Jan 09
Posts: 7
Credit: 1,745,369
RAC: 0
Message 60288 - Posted: 23 Mar 2009, 18:02:26 UTC
Last modified: 23 Mar 2009, 18:02:47 UTC

I ran the memory test today for about 10h (with no detected errors), so it doesn't seem to be a memory problem to me. Yes, I'm the administrator, have all ms updates installed as well as anti-virus and -spyware software.

It didn't work too well on v.6.2.19 either, I'm afraid.
Perhaps the best way to solve my problem is to scan for other hardware issues? If so, is there any 'preferred' software to use for that?



NO do NOT enable the protected application instruction mode, uncheck it, when you install. I have done this for several machines and it works fine.


Are we talking about the same thing here? Anyway, problem "solved". I installed v6.4.7 and can now run boinc as a service. I couldn't do so when I didn't choose protected application execution mode during the install.
ID: 60288 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 66,422,490
RAC: 9,627
Message 60292 - Posted: 24 Mar 2009, 2:22:33 UTC - in response to Message 60288.  
Last modified: 24 Mar 2009, 2:22:52 UTC

Rosetta puts a heavy load on the system as the CPU is at 100% and lots of RAM is in constant access.

Try limiting the CPU to 50% or 60% and see if everything is OK. If so, think about the power supply. You may only have the problem at full load.

Paul
Thx!

Paul

ID: 60292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,179,826
RAC: 3,209
Message 60295 - Posted: 24 Mar 2009, 9:48:20 UTC - in response to Message 60292.  

Rosetta puts a heavy load on the system as the CPU is at 100% and lots of RAM is in constant access.

Try limiting the CPU to 50% or 60% and see if everything is OK. If so, think about the power supply. You may only have the problem at full load.

Paul


Paul has a good point here, but try a slightly different tact by limiting your pc to one cpu instead of 2 and see if that fixes the problem. If so Paul's idea could be a valid one.
ID: 60295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ads

Send message
Joined: 27 Jan 09
Posts: 7
Credit: 1,745,369
RAC: 0
Message 60321 - Posted: 25 Mar 2009, 18:19:53 UTC
Last modified: 25 Mar 2009, 18:30:35 UTC

I have now tried limiting to one core for ~20 hours, it doesn't seem to give any imrovement. The PSU can give 520W, and the computer have worked well during periods of heavier load (ie gaming, and running F@Hs first GPU client (that one was never particularly stable for me either though)).

If no one's got a better suggestion, I'll now proceed by limiting boinc's memory usage to 50% for one day and if that doesn't change anything also re-limit the CPU usage to one core for another day, and see if anything happens.


I might be of interest that even periods during which I haven't been running any computations the computer has had the issues described earlier.
ID: 60321 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,179,826
RAC: 3,209
Message 60325 - Posted: 26 Mar 2009, 10:36:33 UTC - in response to Message 60321.  

I have now tried limiting to one core for ~20 hours, it doesn't seem to give any imrovement. The PSU can give 520W, and the computer have worked well during periods of heavier load (ie gaming, and running F@Hs first GPU client (that one was never particularly stable for me either though)).

If no one's got a better suggestion, I'll now proceed by limiting boinc's memory usage to 50% for one day and if that doesn't change anything also re-limit the CPU usage to one core for another day, and see if anything happens.


I might be of interest that even periods during which I haven't been running any computations the computer has had the issues described earlier.


You could also look and see if there is a Bios upgrade available for your mb. We are getting down, I think, to the machine being the problem, not the software. Oh..do you crunch while you are gaming? Try clicking the snooze button on Boin when you are gaming and see if that helps. The snooze button is there when you right click the Boinc icon by the clock. It will stop the crunching for 2 hours and then automatically resume crunching. If you game longer than that you might try just shutting down Boinc and then restarting it when you are done gaming. Ideally Boinc and gaming should not interfere with each other but this is the the real World and gives another possibility.
ID: 60325 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,675,695
RAC: 11,002
Message 60326 - Posted: 26 Mar 2009, 12:48:51 UTC

try prime95 for a couple of hrs using the second option (max cpu, some RAM)...
ID: 60326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ads

Send message
Joined: 27 Jan 09
Posts: 7
Credit: 1,745,369
RAC: 0
Message 60357 - Posted: 28 Mar 2009, 12:08:56 UTC
Last modified: 28 Mar 2009, 12:09:44 UTC

I rarely game, so it can't reasonably be related to that. I flashed the BIOS, didn't seem to be of any benefit.


p95 gave something though. I've run it twice (max CPU option) for ~1.5h with the results:

run 1:

core#0 fail @1h9m
core#1 stillrunning@1h46min

run 2:

core#0 stillrunning@1h21m
core#1 fail @50m - Fatal Error: Rounding was .4990234375, expected less than .4

I'm not sure what to make of this but will try to make some research. I'd appreciate input through this thread though. I'm planning on running it again on Monday for a longer period of time, temps go down when one of the cores fail so I guess it'll more likely be a temperature problem if only one of the cores fail after say 8h. Of course it may be related to another factor, I just can't think of a likely one right now (if I can consider power supply and memory problems as less likely, can I?).

---
I've also had a couple of random restarts, at least (2) occured after the BIOS flash. I saved this:

(1)
BCCode : f7 BCP1 : 00000004 BCP2 : 000091AC BCP3 : FFFF6E53
BCP4 : 00000000 OSVer : 5_1_2600 SP : 3_0 Product : 256_1

(2)
BCCode : 1000008e BCP1 : C0000005 BCP2 : 805A3A9F BCP3 : AA77E9F8
BCP4 : 00000000 OSVer : 5_1_2600 SP : 3_0 Product : 256_1

Unless anyone here can "see" the problem and share it here, please direct me to where I can find info on how to do so.
ID: 60357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,179,826
RAC: 3,209
Message 60358 - Posted: 28 Mar 2009, 12:33:22 UTC - in response to Message 60357.  

I rarely game, so it can't reasonably be related to that. I flashed the BIOS, didn't seem to be of any benefit.


p95 gave something though. I've run it twice (max CPU option) for ~1.5h with the results:

run 1:

core#0 fail @1h9m
core#1 stillrunning@1h46min

run 2:

core#0 stillrunning@1h21m
core#1 fail @50m - Fatal Error: Rounding was .4990234375, expected less than .4

I'm not sure what to make of this but will try to make some research. I'd appreciate input through this thread though. I'm planning on running it again on Monday for a longer period of time, temps go down when one of the cores fail so I guess it'll more likely be a temperature problem if only one of the cores fail after say 8h. Of course it may be related to another factor, I just can't think of a likely one right now (if I can consider power supply and memory problems as less likely, can I?).

---
I've also had a couple of random restarts, at least (2) occured after the BIOS flash. I saved this:

(1)
BCCode : f7 BCP1 : 00000004 BCP2 : 000091AC BCP3 : FFFF6E53
BCP4 : 00000000 OSVer : 5_1_2600 SP : 3_0 Product : 256_1

(2)
BCCode : 1000008e BCP1 : C0000005 BCP2 : 805A3A9F BCP3 : AA77E9F8
BCP4 : 00000000 OSVer : 5_1_2600 SP : 3_0 Product : 256_1

Unless anyone here can "see" the problem and share it here, please direct me to where I can find info on how to do so.


No idea exactly but try removing and reinstalling the cpu. Putting some new stuff underneath at the same time. I know this is a pain and is not most people idea of a fun way to spend their Saturday, Sunday or whatever day. But with each core giving you troubles on different runs, it seems logical that there is something only slightly wrong someplace.
ID: 60358 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 66,422,490
RAC: 9,627
Message 60363 - Posted: 28 Mar 2009, 14:56:08 UTC - in response to Message 60357.  

This is a weird problem. It looks like you are forced to go back to troubleshooting 101 and start testing parts.

You can start with the lest expensive parts first or find a friend that has some compatible parts and start swapping. You can also look at the probabilities and start there.

System boards are typically reliable but they have the most components. It is hard to start with the system board because everything needs to come out so you might want to start with something a little easier.

Swap the power supply. High proability of failure, relatively low cost and easy to swap. Based on what you told us so far, this is unlikely to be the problem. repeat the prime 95 test. It looks like you will know in about 2 hours.

RAM is easy. Swap it with new or known working RAM from a friend. Repeat the test.

Your system board is next on the list. While system boards are usually reliable, they have lots of components and even 1 bad capacitor could be the issue. repeat the test - BTW - this is my current best guess.

The CPU could be bad. If you have a friend with a Core2 rig, just swap the processors. Prime95 for 2 hours....

You are going to have to do it one at a time. Best of luck.

BTW - I would give prime95 4 hours on some of these tests. If it is heat related you need to give everyhting time to get nice and hot.

Best of luck!


Thx!

Paul

ID: 60363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ads

Send message
Joined: 27 Jan 09
Posts: 7
Credit: 1,745,369
RAC: 0
Message 60699 - Posted: 17 Apr 2009, 18:12:28 UTC

An update:


R@H underwent a major improvement for many days after that I had dis- and reassembled the computer, however, this effect has gradually faded away for some obscure reason. Unfortunately I don't have any parts that can be used for troubleshooting.

I've updated boinc to v.6.6.20 without any effect that I've noticed. I also have several new error messages from windows if anyone's interested...

I'm thinking of "going back" to F@H, but I'm naturally wondering if I'll make more harm than good since my system is obviously unstable. The safest approach would be not to and is what I'm leaning towards, but other thoughts are welcome.
ID: 60699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ads

Send message
Joined: 27 Jan 09
Posts: 7
Credit: 1,745,369
RAC: 0
Message 61232 - Posted: 17 May 2009, 9:59:24 UTC

Another update:

Since minirosetta 1.67 the number of errors has been nearly non-existent. My computer now runs perfectly. I'm concerned though that I might actually be returning failed wu's. Is this possible, or likely?
ID: 61232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 61233 - Posted: 17 May 2009, 10:10:28 UTC - in response to Message 61232.  

My computer now runs perfectly. I'm concerned though that I might actually be returning failed wu's. Is this possible, or likely?


Your Intel(R) Core(TM)2 CPU 6300 computer had a few errors on 11th and 12th May but has since then been returning successful results. Your other two computers seem to have been free of problems as all recent results are listed as valid.
ID: 61233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ads

Send message
Joined: 27 Jan 09
Posts: 7
Credit: 1,745,369
RAC: 0
Message 61236 - Posted: 17 May 2009, 14:42:56 UTC

Your Intel(R) Core(TM)2 CPU 6300 computer had a few errors on 11th and 12th May but has since then been returning successful results


Yes, that's what I meant by near perfectly, which might've been an exaggeration.
What I wondered was whether rosetta could fail at recognizing failed work units or not, since the mentioned computer has been more or less unstable for a long time. If so it might be better not to use it for crunching.
ID: 61236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Majority of WUs fail (and can't install v.6.4.5 as a service)



©2024 University of Washington
https://www.bakerlab.org