Message boards : Number crunching : Any suggestions for this failing machine?
Author | Message |
---|---|
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
I noticed yesterday that one of my nighttime crunchers seemed to have lost its way, and was churning through WUs at a high rate, erroring them out (for the most part with 0xc0000005 errors). I ran three full passes of Memtest86+ just as a quick confidence check, rebooted, and left the machine to start up at its scheduled time. This morning I see it spent the night throwing more errors. Also I'm deferred from communicating for 18 hours or so. I assume that is because BOINC and/or Rosetta are trying to stop a wacky machine from draining the WU queue, only to error out all the work? I've reset the project, and BOINCslots... are empty folders. There are still a lot of files in BOINCprojectsboinc.bakerlab.org_rosetta, should I remove those by hand? Is detach and reattach from the project, uninstall/reinstall, or some other action called for? The machine is a stock full-sized Dell tower, and neither the daytime users nor the event logs show any signs of distress (beyond Rosetta). |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 1,696 |
any chance it could be temperature related? There's a seti-boinc thread here that links to the wiki: http://setiathome.berkeley.edu/forum_thread.php?id=31247 |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Can't help you with the problem, but wanted to clarify for you, yes each host starts with a max WU per day of 100. Every time you report a WU that fails to validate, this is reduced by one. You zipped through 100 WUs and failed on all of them, so this brings you down to a maximum of 1. You can get 1 per day. If you fail on that last one, you have to wait a day to get another. Crunch that one successfully and your maximum is doubled, and doubled again and again on subsequent successes. Yes, this is done to halt the "damage" (nothing is really damaged, I'm just using the word figuratively) done by a problem host. It frees up the bandwidth to the server for hosts that are running well. You can see on the workunit pages, that other hosts were able to crunch the same work unit without incident. So, it doesn't seem likely to be a problem with the work you were sent. Resetting the project will reload all the Rosetta software. If I'm hearing you right, that didn't help. The project reset downloads the application files again, so the files you see in the Rosetta directory should have fresh dates on them. Do you run any other projects on the machine? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
If you haven't already done so, I would try to remove the files in BOINCprojectsboinc.bakerlab.org_rosetta and then reset the project. A database file could have been corrupted. Last resort, reinstall the client. EDIT: ERROR:: Exit at: .fullatom_setup.cc line:650 suggests a corrupted bbdep02.May.sortlib.gz file. But the access violations that start to occur later could be overheating or bad memory.. hardware related etc. Good luck troubleshooting. |
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
dcdc, seti seems to have their server offline, so I can't read the thread right now. The machine is a full-sized Dell tower that never even increased its fan speed when I first set it up for nighttime crunching; but I do remember the version of SpeedFan I was using (at the time) didn't report temps for the box. So I suppose its possible. Memtest86+ didn't make it fail last night, but perhaps that doesn't thermally stress things. I'll grab up-to-date SpeedFan, CPU-Z and such this afternoon, and take another pass this evening. Perhaps something like Prime95 to put synthetic CPU load on as well. I think I can get the box open to make certain there isn't something like a failed CPU cooler fan. Feet1st, I did the reset this morning after checking and seeing more errors, but the machine isn't allowed to run Rosetta until evenging hours, so I don't know if it has any benefit yet. There were no tasks listed in BOINC when I hit Reset, and after that I noticed all the GZ still in the projectsboinc.bakerlab.org_rosetta folder. They all have older dates, and look like data of some sort. I guess I'll find out if it will pull a task and run it this evening after some further hardware testing. Thanks for the explanation of how the WU throttling works. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Those files must be the ones David Kim was talking about. I think I over spoke what I knew there. Apparently reset of the project doesn't get all of these base files. These download once and are referenced for all the work units. And then if the project wants to update them, I'd guess they can. But otherwise, they remain for long time but aren't part of what is reset. Wonder what the logic of that is? You'd think BOINC would just wipe out the directory and reattach to the project, thus re-pulling all such control files too. Oh well, live and learn. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Message boards :
Number crunching :
Any suggestions for this failing machine?
©2025 University of Washington
https://www.bakerlab.org