Any suggestions for this failing machine?

Message boards : Number crunching : Any suggestions for this failing machine?

To post messages, you must log in.

AuthorMessage
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 36201 - Posted: 6 Feb 2007, 14:42:36 UTC

I noticed yesterday that one of my nighttime crunchers seemed to have lost its way, and was churning through WUs at a high rate, erroring them out (for the most part with 0xc0000005 errors). I ran three full passes of Memtest86+ just as a quick confidence check, rebooted, and left the machine to start up at its scheduled time.

This morning I see it spent the night throwing more errors. Also I'm deferred from communicating for 18 hours or so. I assume that is because BOINC and/or Rosetta are trying to stop a wacky machine from draining the WU queue, only to error out all the work?

I've reset the project, and BOINCslots... are empty folders. There are still a lot of files in BOINCprojectsboinc.bakerlab.org_rosetta, should I remove those by hand? Is detach and reattach from the project, uninstall/reinstall, or some other action called for?

The machine is a stock full-sized Dell tower, and neither the daytime users nor the event logs show any signs of distress (beyond Rosetta).
ID: 36201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 1,696
Message 36204 - Posted: 6 Feb 2007, 15:30:16 UTC - in response to Message 36201.  
Last modified: 6 Feb 2007, 15:30:32 UTC

any chance it could be temperature related? There's a seti-boinc thread here that links to the wiki: http://setiathome.berkeley.edu/forum_thread.php?id=31247

ID: 36204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 36205 - Posted: 6 Feb 2007, 16:19:36 UTC

Can't help you with the problem, but wanted to clarify for you, yes each host starts with a max WU per day of 100. Every time you report a WU that fails to validate, this is reduced by one. You zipped through 100 WUs and failed on all of them, so this brings you down to a maximum of 1. You can get 1 per day. If you fail on that last one, you have to wait a day to get another. Crunch that one successfully and your maximum is doubled, and doubled again and again on subsequent successes.

Yes, this is done to halt the "damage" (nothing is really damaged, I'm just using the word figuratively) done by a problem host. It frees up the bandwidth to the server for hosts that are running well.

You can see on the workunit pages, that other hosts were able to crunch the same work unit without incident. So, it doesn't seem likely to be a problem with the work you were sent.

Resetting the project will reload all the Rosetta software. If I'm hearing you right, that didn't help. The project reset downloads the application files again, so the files you see in the Rosetta directory should have fresh dates on them.

Do you run any other projects on the machine?
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 36205 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 36215 - Posted: 6 Feb 2007, 20:15:12 UTC
Last modified: 6 Feb 2007, 21:23:22 UTC

If you haven't already done so, I would try to remove the files in BOINCprojectsboinc.bakerlab.org_rosetta and then reset the project. A database file could have been corrupted.

Last resort, reinstall the client.


EDIT:

ERROR:: Exit at: .fullatom_setup.cc line:650 suggests a corrupted bbdep02.May.sortlib.gz file.

But the access violations that start to occur later could be overheating or bad memory.. hardware related etc. Good luck troubleshooting.
ID: 36215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 36218 - Posted: 6 Feb 2007, 20:31:55 UTC

dcdc, seti seems to have their server offline, so I can't read the thread right now.

The machine is a full-sized Dell tower that never even increased its fan speed when I first set it up for nighttime crunching; but I do remember the version of SpeedFan I was using (at the time) didn't report temps for the box. So I suppose its possible. Memtest86+ didn't make it fail last night, but perhaps that doesn't thermally stress things.

I'll grab up-to-date SpeedFan, CPU-Z and such this afternoon, and take another pass this evening. Perhaps something like Prime95 to put synthetic CPU load on as well. I think I can get the box open to make certain there isn't something like a failed CPU cooler fan.

Feet1st, I did the reset this morning after checking and seeing more errors, but the machine isn't allowed to run Rosetta until evenging hours, so I don't know if it has any benefit yet. There were no tasks listed in BOINC when I hit Reset, and after that I noticed all the GZ still in the projectsboinc.bakerlab.org_rosetta folder. They all have older dates, and look like data of some sort. I guess I'll find out if it will pull a task and run it this evening after some further hardware testing.

Thanks for the explanation of how the WU throttling works.
ID: 36218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 36221 - Posted: 6 Feb 2007, 22:00:42 UTC

Those files must be the ones David Kim was talking about. I think I over spoke what I knew there. Apparently reset of the project doesn't get all of these base files. These download once and are referenced for all the work units. And then if the project wants to update them, I'd guess they can. But otherwise, they remain for long time but aren't part of what is reset. Wonder what the logic of that is? You'd think BOINC would just wipe out the directory and reattach to the project, thus re-pulling all such control files too. Oh well, live and learn.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 36221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Any suggestions for this failing machine?



©2025 University of Washington
https://www.bakerlab.org