Message boards : Number crunching : Majority of WUs fail (and can't install v.6.4.5 as a service)
Author | Message |
---|---|
ads Send message Joined: 27 Jan 09 Posts: 7 Credit: 1,745,369 RAC: 0 |
So most of my WUs fail, output files being absent, and I don't really know why. What I've gathered from the forums thus far is that there might be a hardware problem, I believe so since I have had random restarts and received "serious error" messages from windows, though infrequently, ever since I assembled the machine. I've been running Folding@home for two years without problems, though. Temperatures are what I consider normal, ~52C on a Intel Core2duo 6300 @1.86GHz, which is about 10C less compared to when I ran F@H. Not OC'd Apart from this I think I'm clueless. I consider reassembling it as a last resort. --- I also can't install the as of now latest boinc client software as a service. I can do so with v.6.2.19. The way I believe I should be able to manage to do so and that works with the earlier software is to enable protected application execution mode during install. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,179,826 RAC: 3,209 |
So most of my WUs fail, output files being absent, and I don't really know why. NO do NOT enable the protected application instruction mode, uncheck it, when you install. I have done this for several machines and it works fine. You said you are having intermittent problems with the pc...have you run MS's memory tester yet? It is not fast and you should shut down Boinc when you run it, but it might point to a problem area. Intermittent problems are the absolute worst to figure out, it happens when no one is there but the machine works fine when you are trying to figure it out!! Have you done all the MS updates since you built it? I am guessing you are also running an anti-virus and anti-spyware program too. That would rule out outside influences. 52c is pretty cool, my laptop here is running at 70c and that is on top of a cooler! Version 6.4.5 still has some bugs in it, you might be better off with 6.2.19! You are the Administrator on the pc's, right? |
ads Send message Joined: 27 Jan 09 Posts: 7 Credit: 1,745,369 RAC: 0 |
I ran the memory test today for about 10h (with no detected errors), so it doesn't seem to be a memory problem to me. Yes, I'm the administrator, have all ms updates installed as well as anti-virus and -spyware software. It didn't work too well on v.6.2.19 either, I'm afraid. Perhaps the best way to solve my problem is to scan for other hardware issues? If so, is there any 'preferred' software to use for that? NO do NOT enable the protected application instruction mode, uncheck it, when you install. I have done this for several machines and it works fine. Are we talking about the same thing here? Anyway, problem "solved". I installed v6.4.7 and can now run boinc as a service. I couldn't do so when I didn't choose protected application execution mode during the install. |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,422,060 RAC: 9,629 |
Rosetta puts a heavy load on the system as the CPU is at 100% and lots of RAM is in constant access. Try limiting the CPU to 50% or 60% and see if everything is OK. If so, think about the power supply. You may only have the problem at full load. Paul Thx! Paul |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,179,826 RAC: 3,209 |
Rosetta puts a heavy load on the system as the CPU is at 100% and lots of RAM is in constant access. Paul has a good point here, but try a slightly different tact by limiting your pc to one cpu instead of 2 and see if that fixes the problem. If so Paul's idea could be a valid one. |
ads Send message Joined: 27 Jan 09 Posts: 7 Credit: 1,745,369 RAC: 0 |
I have now tried limiting to one core for ~20 hours, it doesn't seem to give any imrovement. The PSU can give 520W, and the computer have worked well during periods of heavier load (ie gaming, and running F@Hs first GPU client (that one was never particularly stable for me either though)). If no one's got a better suggestion, I'll now proceed by limiting boinc's memory usage to 50% for one day and if that doesn't change anything also re-limit the CPU usage to one core for another day, and see if anything happens. I might be of interest that even periods during which I haven't been running any computations the computer has had the issues described earlier. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,179,826 RAC: 3,209 |
I have now tried limiting to one core for ~20 hours, it doesn't seem to give any imrovement. The PSU can give 520W, and the computer have worked well during periods of heavier load (ie gaming, and running F@Hs first GPU client (that one was never particularly stable for me either though)). You could also look and see if there is a Bios upgrade available for your mb. We are getting down, I think, to the machine being the problem, not the software. Oh..do you crunch while you are gaming? Try clicking the snooze button on Boin when you are gaming and see if that helps. The snooze button is there when you right click the Boinc icon by the clock. It will stop the crunching for 2 hours and then automatically resume crunching. If you game longer than that you might try just shutting down Boinc and then restarting it when you are done gaming. Ideally Boinc and gaming should not interfere with each other but this is the the real World and gives another possibility. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,675,695 RAC: 11,002 |
try prime95 for a couple of hrs using the second option (max cpu, some RAM)... |
ads Send message Joined: 27 Jan 09 Posts: 7 Credit: 1,745,369 RAC: 0 |
I rarely game, so it can't reasonably be related to that. I flashed the BIOS, didn't seem to be of any benefit. p95 gave something though. I've run it twice (max CPU option) for ~1.5h with the results: run 1: core#0 fail @1h9m core#1 stillrunning@1h46min run 2: core#0 stillrunning@1h21m core#1 fail @50m - Fatal Error: Rounding was .4990234375, expected less than .4 I'm not sure what to make of this but will try to make some research. I'd appreciate input through this thread though. I'm planning on running it again on Monday for a longer period of time, temps go down when one of the cores fail so I guess it'll more likely be a temperature problem if only one of the cores fail after say 8h. Of course it may be related to another factor, I just can't think of a likely one right now (if I can consider power supply and memory problems as less likely, can I?). --- I've also had a couple of random restarts, at least (2) occured after the BIOS flash. I saved this: (1) BCCode : f7 BCP1 : 00000004 BCP2 : 000091AC BCP3 : FFFF6E53 BCP4 : 00000000 OSVer : 5_1_2600 SP : 3_0 Product : 256_1 (2) BCCode : 1000008e BCP1 : C0000005 BCP2 : 805A3A9F BCP3 : AA77E9F8 BCP4 : 00000000 OSVer : 5_1_2600 SP : 3_0 Product : 256_1 Unless anyone here can "see" the problem and share it here, please direct me to where I can find info on how to do so. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,179,826 RAC: 3,209 |
I rarely game, so it can't reasonably be related to that. I flashed the BIOS, didn't seem to be of any benefit. No idea exactly but try removing and reinstalling the cpu. Putting some new stuff underneath at the same time. I know this is a pain and is not most people idea of a fun way to spend their Saturday, Sunday or whatever day. But with each core giving you troubles on different runs, it seems logical that there is something only slightly wrong someplace. |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,422,060 RAC: 9,629 |
This is a weird problem. It looks like you are forced to go back to troubleshooting 101 and start testing parts. You can start with the lest expensive parts first or find a friend that has some compatible parts and start swapping. You can also look at the probabilities and start there. System boards are typically reliable but they have the most components. It is hard to start with the system board because everything needs to come out so you might want to start with something a little easier. Swap the power supply. High proability of failure, relatively low cost and easy to swap. Based on what you told us so far, this is unlikely to be the problem. repeat the prime 95 test. It looks like you will know in about 2 hours. RAM is easy. Swap it with new or known working RAM from a friend. Repeat the test. Your system board is next on the list. While system boards are usually reliable, they have lots of components and even 1 bad capacitor could be the issue. repeat the test - BTW - this is my current best guess. The CPU could be bad. If you have a friend with a Core2 rig, just swap the processors. Prime95 for 2 hours.... You are going to have to do it one at a time. Best of luck. BTW - I would give prime95 4 hours on some of these tests. If it is heat related you need to give everyhting time to get nice and hot. Best of luck! Thx! Paul |
ads Send message Joined: 27 Jan 09 Posts: 7 Credit: 1,745,369 RAC: 0 |
An update: R@H underwent a major improvement for many days after that I had dis- and reassembled the computer, however, this effect has gradually faded away for some obscure reason. Unfortunately I don't have any parts that can be used for troubleshooting. I've updated boinc to v.6.6.20 without any effect that I've noticed. I also have several new error messages from windows if anyone's interested... I'm thinking of "going back" to F@H, but I'm naturally wondering if I'll make more harm than good since my system is obviously unstable. The safest approach would be not to and is what I'm leaning towards, but other thoughts are welcome. |
ads Send message Joined: 27 Jan 09 Posts: 7 Credit: 1,745,369 RAC: 0 |
Another update: Since minirosetta 1.67 the number of errors has been nearly non-existent. My computer now runs perfectly. I'm concerned though that I might actually be returning failed wu's. Is this possible, or likely? |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
My computer now runs perfectly. I'm concerned though that I might actually be returning failed wu's. Is this possible, or likely? Your Intel(R) Core(TM)2 CPU 6300 computer had a few errors on 11th and 12th May but has since then been returning successful results. Your other two computers seem to have been free of problems as all recent results are listed as valid. |
ads Send message Joined: 27 Jan 09 Posts: 7 Credit: 1,745,369 RAC: 0 |
Your Intel(R) Core(TM)2 CPU 6300 computer had a few errors on 11th and 12th May but has since then been returning successful results Yes, that's what I meant by near perfectly, which might've been an exaggeration. What I wondered was whether rosetta could fail at recognizing failed work units or not, since the mentioned computer has been more or less unstable for a long time. If so it might be better not to use it for crunching. |
Message boards :
Number crunching :
Majority of WUs fail (and can't install v.6.4.5 as a service)
©2024 University of Washington
https://www.bakerlab.org