90% failure rate

Message boards : Rosetta@home Science : 90% failure rate

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
sharder8
Avatar

Send message
Joined: 2 Feb 06
Posts: 7
Credit: 15,648,378
RAC: 0
Message 12345 - Posted: 20 Mar 2006, 19:24:04 UTC

Of the 20 computers that I'm running/have run Rosetta on, only one has had a 90% + failure rate. That one is a dual Xeon 450 running @ 500MHz. Consequently, that one was moved to another project, as I thought/felt if was/is a machine problem. That box has crunched [FAD], DIMES, and RC5-72 without any problems. In this case, it probably isn't much of a loss to the Rosetta project.

Recently though, another machine started having problems and would end up with an error message containing the message "daily quota met". The only way I was able to recover was to do a complete un-install, followed by a clean install. Unfortunately, now it continues to get the error regardless of what I do. That machine is a Mobile 2800+ Semperon. It's currently running DIMES and RC5 without any problems.

Finally, I've run into the 1% "stuck" problem. This one is starting to get real tiring and I've stopped Rosetta on 2 machines that seemed to get by far the majority of jobs stuck at 1%, that I've had. I understand that this problem is being worked on and will continue crunching Rosetta on the remainder of my machines.

Harder
ID: 12345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 12349 - Posted: 20 Mar 2006, 20:52:34 UTC

What is the RC-5 project? I've looked but didn't see anything about it. Thank you.
Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 12349 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 12351 - Posted: 20 Mar 2006, 21:44:29 UTC - in response to Message 12349.  

What is the RC-5 project? I've looked but didn't see anything about it. Thank you.


It's a project trying to crack encryption algorithm.

RC5

Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 12351 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13022 - Posted: 4 Apr 2006, 3:24:34 UTC

Well after comming home from overseas I still see a problem with the program and looking at other threads hear no explanation. Been awhile since winter holidays. I have shut down 7 of my machines. I have a machine turning in result after result with no cpu time shown and no points showing no errors on the Client. That will number 8 I am shutting down on this project.

Rosetta is not only about to lose me forever on this project but my whole team. I have talk to friends on other teams and you guys would not believe the real dislike thats brewing out there for this program. The attitude around her seems to be so what? Well guess what happens when you get people out there calling Roseetta a lousy DC project in the forums?

Explain this to me?
https://boinc.bakerlab.org/rosetta/results.php?hostid=58422
Results for computer

This machine used to do a good job on every DC project on it.....Its doing nothing now worth anything.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13025 - Posted: 4 Apr 2006, 4:21:46 UTC

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1323#12948
is a thread where some others are discussing your problem showing up on their Win98 machines. (No time, no credit).

Which means we need more Win98 machines testing out Ralph; and monitored by those that keep track of their machines.

The 90% failure rate that happened prior to you leaving was described elsewhere as a batch of failing WUs.

For this problem.. do you have the option of upgrading to Win2k or WinXP or jumping to Linux? (To help prove that it's an OS issue, not hardware.)

Keep in mind that this client is undergoing the same types of problems that other medical apps had in their early days, and those of us lucky to have come in after the problems were ironed out - never got to see. (This is my first time experiencing the "early stage".) But things are improving. Although it looks like we'll need a 4.84 client update for the Win98 users..

David(s)/Rom, etc: How can we help the programmers track down this problem?
ID: 13025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13030 - Posted: 4 Apr 2006, 5:30:23 UTC - in response to Message 13025.  
Last modified: 4 Apr 2006, 5:31:09 UTC

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1323#12948
is a thread where some others are discussing your problem showing up on their Win98 machines. (No time, no credit).

Which means we need more Win98 machines testing out Ralph; and monitored by those that keep track of their machines.

The 90% failure rate that happened prior to you leaving was described elsewhere as a batch of failing WUs.

For this problem.. do you have the option of upgrading to Win2k or WinXP or jumping to Linux? (To help prove that it's an OS issue, not hardware.)

Keep in mind that this client is undergoing the same types of problems that other medical apps had in their early days, and those of us lucky to have come in after the problems were ironed out - never got to see. (This is my first time experiencing the "early stage".) But things are improving. Although it looks like we'll need a 4.84 client update for the Win98 users..

David(s)/Rom, etc: How can we help the programmers track down this problem?

I want you to bare in mind that this machine ran Rosetta perfect as it is set now. Then I had the high failure rate with all nine and not one of those machines are identicle. I do not have the option of using a newer windows OS. Matter of what I use the money for, another cruncher or buying licenses just for machines doing DC projects.

Thank you for your response....
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Whl.

Send message
Joined: 29 Dec 05
Posts: 203
Credit: 275,802
RAC: 0
Message 13032 - Posted: 4 Apr 2006, 5:38:54 UTC

I dont have time to attach and report back to Ralph right now, or babysit this thing anymore (too much else happening). My machines were working fine up till 4.83 was released. I will let the existing jobs in the cache run and empty and try back here in a month or so. Hope you sort out all the bugs guys. Good luck and all the best.
ID: 13032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13048 - Posted: 4 Apr 2006, 17:27:43 UTC
Last modified: 4 Apr 2006, 17:45:14 UTC

Pphlan wrote:

https://boinc.bakerlab.org/rosetta/results.php?hostid=58422

This machine used to do a good job on every DC project on it.....Its doing nothing now worth anything.


Just so we're all on the same page here, from what I understand, on the Win98 PCs in question the SCIENTIFIC computations work fine (from what I can tell by watching the results output of Pphalan's PC), but NO CREDITS are granted, because BOINC reports 0 seconds and claims 0 credits.

Also, AFAIK, everything credit-related (timing, claiming etc) is still done IN BOINC, not in the science application for ALL BOINC projects except SETI-Beta. Apparently the fixes for 4.83 had an effect on BOINC's timing under Win98.

I guess the project can run a script to correct the credits for WUs which complete correctly, yet due to Win98/BOINC/R interaction time spent is mis-reported.

So the big fuss is (again) about (temporary?) credits. Personally I'd be upset if my PCs spent the time without producing any useful results. I guess everyone is entitled to his priorities.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13051 - Posted: 4 Apr 2006, 18:44:52 UTC - in response to Message 13048.  

Pphlan wrote:

https://boinc.bakerlab.org/rosetta/results.php?hostid=58422

This machine used to do a good job on every DC project on it.....Its doing nothing now worth anything.


Just so we're all on the same page here, from what I understand, on the Win98 PCs in question the SCIENTIFIC computations work fine (from what I can tell by watching the results output of Pphalan's PC), but NO CREDITS are granted, because BOINC reports 0 seconds and claims 0 credits.

Also, AFAIK, everything credit-related (timing, claiming etc) is still done IN BOINC, not in the science application for ALL BOINC projects except SETI-Beta. Apparently the fixes for 4.83 had an effect on BOINC's timing under Win98.

I guess the project can run a script to correct the credits for WUs which complete correctly, yet due to Win98/BOINC/R interaction time spent is mis-reported.

So the big fuss is (again) about (temporary?) credits. Personally I'd be upset if my PCs spent the time without producing any useful results. I guess everyone is entitled to his priorities.

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13053 - Posted: 4 Apr 2006, 18:46:17 UTC - in response to Message 7476.  

We should be clear of the "bad" work units by now. There still is a 7% chance of getting a bad random number seed but it should in no way be at 90%. Batch 205 is most definitely done by now.

I am on holiday break, but when I and a few others get back, we will fix the seed problem and grant credit to those affected by the recent issues.

I could care less about credit. I want to know that my efforts are doing something for a worthwhile project. I have no interest in projects that are not medical science based. Medical science should have the priority in any project of this type. Life should always be the first consideration. As a Battalion Commander going to Iraq soon that attitude is foremost on my mind. I want to know something of value is being done. If I find myself in my command throwing resources away on something that is not working I change what I am doing.

My second post in this thread.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13057 - Posted: 4 Apr 2006, 20:20:34 UTC
Last modified: 4 Apr 2006, 20:23:45 UTC

I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


Assuming you're not joking, it's rather easy to tell whether a machine is doing the scientific work or not, you can just clicking on the resultid URL, e.g.:

https://boinc.bakerlab.org/rosetta/result.php?resultid=15867586

Exit status 0 (0x0)
stderr out
<core_client_version>5.3.1</core_client_version>
<stderr_txt>
# random seed: 1822271
# cpu_run_time_pref: 7200
# DONE :: 1 starting structures built 11 (nstruct) times
# This process generated 11 decoys from 11 attempts

</stderr_txt>


So you can see that your PC computed 11 predicted protein structures, within the 2hrs (7200sec) it ran on this particular WorkUnit and exited with a status of 0 (success). On WUs/PCs with problems, there are lots of different error codes, which people report in the various specific error-reporting threads in "Number Crunching".

This particular issue is a glitch with how BOINC can track process time under Win98 and I've seen it discussed in various other BOINC projects.

My 2 cents...

Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 13063 - Posted: 5 Apr 2006, 2:27:53 UTC - in response to Message 13057.  
Last modified: 5 Apr 2006, 2:29:14 UTC

This particular issue is a glitch with how BOINC can track process time under Win98 and I've seen it discussed in various other BOINC projects.


This is a known issue with boinc, not rosetta. It is one reason why the official supported Windows platforms are only XP, 2000, and 2003 server.
https://boinc.bakerlab.org/rosetta/rah_requirements.php

Some people don't have any issue running win98, others do... you unfortunately are one of the unlucky ones.
ID: 13063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13065 - Posted: 5 Apr 2006, 5:11:02 UTC

Does the error show up in Win98SE, or just Win98? (Or the reverse?)


ID: 13065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Johnathon

Send message
Joined: 5 Nov 05
Posts: 120
Credit: 138,226
RAC: 0
Message 13070 - Posted: 5 Apr 2006, 6:57:11 UTC

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1177#13069
ID: 13070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Whl.

Send message
Joined: 29 Dec 05
Posts: 203
Credit: 275,802
RAC: 0
Message 13072 - Posted: 5 Apr 2006, 8:04:21 UTC

I see Dr Baker says the science is unaffected with the Win98 problem, so I will continue with those machines.
ID: 13072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 13252 - Posted: 8 Apr 2006, 17:38:00 UTC - in response to Message 13051.  

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find.

I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application.

ID: 13252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13318 - Posted: 9 Apr 2006, 14:13:35 UTC - in response to Message 13285.  

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find.

I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application.


Your assesment is correct. Workunits that error before completing a model are very useful in finding errors in the software. BUT if they finish at least one model before they fail, they are also useful for the science.

As I understand it now the problem is with boinc not rosetta. So hows an error with boinc doing any good for rosetta? Oh my primary machine uploaded some more errors for you....its XP Pro. And all my remotes are XP that keep dropping the program. They have not been added back, just to much of a pain.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13318 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 13360 - Posted: 9 Apr 2006, 20:44:25 UTC - in response to Message 13341.  


...
Try to remember that unlike most every other BOINC project, Rosetta is trying to find the correct computing approach to the protein problem while at the same time modeling the proteins. In other words they are researching the type of computing required to model proteins. This means that the application itself is part of what is being researched. In practical terms that means that the project feels more like a test environment than say SETI or Einstein. On most BOINC projects the application code required is very clear and stable. That is not the case where the research is focused on determining in part what processing must actually be performed to accomplish the goal.


This is an important point which perhaps still needs to be spelt out clearly to newcomers, and indeed to all who joined before last Xmas: By the standards of Einstein or SETI, Rosetta is a permanent Beta project. It is getting better (due mainly to using Ralph for alpha testing of new WU) and it will continue to get better for a while yet, but it will never be as reliable as Einstein or SETI.

For some users that will turn them away - especially those who seek every last credit. Fair enough - as donors you have the right to donate whereve you feel most happy. Maybe they want every last credit, or maybe they have a lot of boxes at a lot of different places and want the most reliable project going.

For other users, the science is more important than the credits and reliability is important but not absolutely critical. They'd be happier now than they were last winter.

If you want to run Rosetta code that is tested to around SETI standard of quality, and is being used for production runs on real proteins, then I'd suggest the World Computing Grid, and select the option for the Human Proteome Project. My son runs that and has not had any problems at all. They are using an older version of Rosetta - version 4.21 - maybe not so fast at solving the proteins but it does have seem to have the wrinkles ironed out.

Dr Baker is involved with both projects and has been quoted as saying that both projects are important steps towards solving the problems of protein structures.

Anyone who is still unhappy with the level of reliability here, I'd suggest going here and follow the link for people who already run BOINC. I'd also suggest checking back every couple of months as the reliability continues to improve here. I don't suggest checking back if you want the very highest reliability - stick with the production model over at the grid.

hope that helps
ID: 13360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13380 - Posted: 10 Apr 2006, 8:44:33 UTC - in response to Message 13341.  

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find.

I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application.


Your assesment is correct. Workunits that error before completing a model are very useful in finding errors in the software. BUT if they finish at least one model before they fail, they are also useful for the science.

As I understand it now the problem is with boinc not rosetta. So hows an error with boinc doing any good for rosetta? Oh my primary machine uploaded some more errors for you....its XP Pro. And all my remotes are XP that keep dropping the program. They have not been added back, just to much of a pain.


Some of the issues are BOINC related, but that does not mean that the models completed in a work unit that errors out are not useful. Moreover, ANY errors that are identified (BOINC or otherwise) help improve the application. If it is a BOINC issue, in some cases the application can be modified to work around the problem. But ONLY if the errors can be examined. That is why all of the returned results are useful. Aborted by GUI results are less useful that the ones that are allowed to crash on their own, but they are all useful.

In some cases the project asks people who are having errors to connect to the Ralph project. In Ralph the application returns more detailed error results which are used to improve the application. Basically the same code, but used to find and kill the bugs.

Try to remember that unlike most every other BOINC project, Rosetta is trying to find the correct computing approach to the protein problem while at the same time modeling the proteins. In other words they are researching the type of computing required to model proteins. This means that the application itself is part of what is being researched. In practical terms that means that the project feels more like a test environment than say SETI or Einstein. On most BOINC projects the application code required is very clear and stable. That is not the case where the research is focused on determining in part what processing must actually be performed to accomplish the goal.

That is why there is no such thing as "wasted" CPU time on Rosetta. Even the errors are valuable to the research. It does result in lost credit from time to time for some users. But that is why the Rosetta team (unlike most BOINC projects) will frequently go back to award credits. They view the errors as being important to the research. In a lot of cases these awards have been to single users for a problem unique to their situation. If you read the boards from the other projects, credit awards after the fact are a very rare thing, and I have never seen credits awarded to individual users for a unique problem. That is not the case here. While there is some delay in the awards due to the time demands placed on the project team, the credit is granted in almost every case where people have asked.

Thank you very much. My background is in Physics and Electrical Engineering, I want to run just one project. I have to much of a Communication systems background to ever run Seti.....Those massive radio noise makers called stars and the vast size of the galaxy make it futile. lol
Keep up the good work...I appreciate it.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 13382 - Posted: 10 Apr 2006, 9:20:31 UTC - in response to Message 13380.  
Last modified: 10 Apr 2006, 9:23:31 UTC

I have to much of a Communication systems background to ever run Seti.....Those massive radio noise makers called stars and the vast size of the galaxy make it futile. lol



On that at least we agree! You will see from my stats the relative importance I've given SETI ;-)

Keep up the good work...I appreciate it.


Me too - the quality of the feedback and responsiveness to users is what keep me donating time here, even tho physics is my favoured field.
ID: 13382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Rosetta@home Science : 90% failure rate



©2025 University of Washington
https://www.bakerlab.org