Message boards : Rosetta@home Science : 90% failure rate
Author | Message |
---|---|
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
I am showing a 90% PLUS failure rate on all jobs done in the last three days on 9 machines. I am headed overseas and have no time for program thats such high maintenance. My remotes are always shutting down the project. It is simply not worth doing when it becomes nothing but extra wear and tear on my machines running a CPU at 100% non-stop for NOTHING. Never mind the increase in an electric bill. Rosetta is one big disappointment and does not deserve my effort. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 0 |
It is simply not worth doing when it becomes nothing but extra wear and tear on my machines running a CPU at 100% non-stop for NOTHING. From the project perspective: Continuing to run accellerates the removal of the "short WUs" from the system. Some of these appear (to me!) to even be returning valid data in the first few passes before they hit the random-number-glitch that causes them to error out. From the participant's perspective: If a "short" WU has taken some amount of CPU time without giving credit, unlike the majority, which fail in just a few seconds, there is the _possibility_ that these will receive credit, after the staff returns from the holidays. The staff has _already_ said that they would grant credit for any of the "DEFAULT_xxxxx_205" results, whether aborted by the participant, or allowed to run until they hit the "maximum CPU" limit. The "short WUs", if you are on a high-speed continuous connection, are really not much of a problem. They come, they run for a few seconds, they error out. It is possible that even _those_ may be granted credit, as the project staff feels pretty bad about having them slip out the door. If you are on dial-up, or pay for your bandwidth, then these _ARE_ a problem, and the simple solution is to suspend Rosetta until they are gone. Maybe it's just me... but if I were _only_ interested in the credit, I would have my PC running CPDN non-stop right now, as that's the project that grants the highest "credit per hour". Instead, it's running it's standard share of Rosetta, CPDN, and Einstein, because those are the three projects I am most interested in at the moment. I understand the "competitive urge", but I guess it's just not _that_ big a deal to me. The project is worthwhile, I'm already donating both my CPU time _and_ my personal time, so the fact that I'm unlikely to earn very many credits in the next few days for that time... who cares? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
We should be clear of the "bad" work units by now. There still is a 7% chance of getting a bad random number seed but it should in no way be at 90%. Batch 205 is most definitely done by now. I am on holiday break, but when I and a few others get back, we will fix the seed problem and grant credit to those affected by the recent issues. |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
We should be clear of the "bad" work units by now. There still is a 7% chance of getting a bad random number seed but it should in no way be at 90%. Batch 205 is most definitely done by now. I could care less about credit. I want to know that my efforts are doing something for a worthwhile project. I have no interest in projects that are not medical science based. Medical science should have the priority in any project of this type. Life should always be the first consideration. As a Battalion Commander going to Iraq soon that attitude is foremost on my mind. I want to know something of value is being done. If I find myself in my command throwing resources away on something that is not working I change what I am doing. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
I want to know that my efforts are doing something for a worthwhile project. Personally...I would say that Rosetta is, easily, the best maintained, smoothest running, most worthwhile distributed computing project currently available. Read the boards some more and the entire site and compare to the other projects and I believe you will soon agree. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
I have no interest in projects that are not medical science based. Medical science should have the priority in any project of this type. Boy are you in luck ... You have 3 projects with many more in testing. So Rosetta@Home, Predictor@Home, and WCG are all available for you to run (I have run, or am running all three). I want to know something of value is being done. If I find myself in my command throwing resources away on something that is not working I change what I am doing. Paul is a little contrarian with respect to the way many feel about problems like what we see here at Rosetta@Home. I cannot, in my wildest imagination, understand why anyone would think that the project is not more upset by the loss of work effort than we could ever be ... This is their ticket to fame as it were. With that in mind. The staff has told us a number of times that even the failures are interesting. Same thing at CPDN. And, though some of the errors were from mal-formed work units, this has happened on other projects (SETI@Home created 40-60K zero length work units the last outage) and will happen again. You do have a little bit of a unique situation with many remotes. In this case perhaps Rosetta@Home is not for you at this time. THough I will point out that Predictor@Home has issued work that will pop-up a FORTRAN error dialog that stops the computer from running any other BOINC process until "Ok" is pressed (I lost several days processing time over that one). Again, no project is without flaw or will always issue computable work. Rosetta@Home just hit a bad spot, and we hope it is cleared up ... |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
absolutely agree with this. The reason the project rushed out some new style work units just before the hols was to get some new interesting science crunched while they were away. Worthy motive. As a rush job, they overlooked some vital points. They have apologised. Meanwhile they are all off on vacation (or at least home with their families). This means that it will take longer to fix than if it happended at another time. Except at another time it would not have been a rush job in the first place. When you are out in Iraq, as a commander there will be times when you have some kind of target of opportunity and you will need to make a snap judgment: go for it unprepared or hang back till you are fully ready but by then the target may have gone. Sometimes you will get it right. Sometimes you will bet it wrong and then you will have that awful feeling after the troops have committed to the action that it is all going wrong and it is too late to recall. People who explain to you where things went wrong will be doing you a favour. People who go on and on about it once you have understood and admitted the error will not be helping at all. To turn the analogy back to here, I am suggesting to you that you have crossed the line into the latter kind of unhelpful criticsm, though clearly you mean to be helpful. Please, you clearly are not happy crunching for Rosetta in the current situation. Please, suspend work fetch till (say) mid Jan and then check back. I am nothing to this project but a new member, so I am not speaking on behalf of Rosetta but just from myself. We understand that it was a mistake, we understand that you and many others are upset about it. You are yearning to do something useful and it is all going pear shaped and it is not your fault it is the project's mistake. The project has shown it understands that. The project plans to do better when they return after the break. They, like you, care about the medical science and with or without your further criticism they will already be regretting the lost science. Whether we go elsewhere till they have a chance to get things sorted, or whether we stay here, please lets move on. We have had the repentance it is time for the forgiveness -- especially some will feel at this time of year. Please? River~~ |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
River do this project a favour and stop talking to me. Now you are going to lecture a multiple combat vet on operations? It is more than one incident with this project, I will do a wait and see until after the holidays. I cant believe some people, I have one guy telling me things do not always go right, 3 days ago we bury a man killed in a training accident and this guy thinks I need a lecture on how things do not always go right. Reminded myself that I am not talking to people who run the project. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
River do this project a favour and stop talking to me. Now you are going to lecture a multiple combat vet on operations? Sorry. You are right, I know nothing about being a soldier. But it seems fair to me. You have been lecturing computer professionals in dozens of postings about how their job should be done, and what their priorities should be. One little lecture back from me and you can't hack it. I will do a deal: if you stop telling the programmers what their priorities should be, I will stop pretending I know anything about your job. Fair? |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
I am going to ignore your utter arogance because at no time did I ever tell anyone how to do their job. I want to know why I am contributing less than 10% to this project even now. The machines I left on have done nothing but client errors. I mean right now a success is rare. Bad batch my postier portion of my body...... One of my Captians called me up to tell me just before midnight he became a Daddy to a girl...3 minutes before midnight....WE....my team has a new membern in our family. SHUT UP RIVER....Everytime you say something to me its an arrogant comment I would not say to a child let alone to an adult. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
Merry Christmas, and peace and goodwill to all! :) Regards, Bob P. |
Deamiter Send message Joined: 9 Nov 05 Posts: 26 Credit: 3,793,650 RAC: 0 |
As a bit of a sanity check, do you have each of your BOINC projects set to "leave in memory" when they switch projects? It's a known issue with the Rosetta alpha project that some machines will error out without this changed. Since you're obviously running multiple projects, you'd have to set it on each project or when you updated, it'd keep switching the setting back and forth. It's not an ideal situation (they don't claim to be gold or anything) and they're working on the problem. It doesn't affect EVERY computer configuration (as I've never seen the problem on any of my four Dell computers) but it sounds like that could be your problem. It's something to try if you're still erroring out all your WUs. I haven't seen any problems in the last couple days, so obviously I haven't gotten any of the longer problem WUs... If that doesn't fix it, or if it's too much for you, I'd say don't hesitate to drop the project altogether. We'll certainly miss your cycles, but running pre-release alpha and beta tests isn't for everybody. I hope you won't hold it against Rosetta in general, and if you do end up dropping Rosetta for now, it'd be nice to see you back when a totally stable version is released. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
at no time did I ever tell anyone how to do their jobit seemed to me at the time that you were. Looking back on your posts now I can't see how I thought that. Maybe partly frustration from my own boxes' problems - several of my remote boxes crunched nothing useful over Christmas - but others were working fine and it wasn't clear why. Anyway, however it happened I'm sorry. Have a good 2006 River~~ |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
at no time did I ever tell anyone how to do their jobit seemed to me at the time that you were. No problems I am a little on edge here. Most of my job failures are not the 205 series and I have one machine in its second day on one job and it hasnt budge from 10%. Guess I should abort it but dont know. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
Deamiter Send message Joined: 9 Nov 05 Posts: 26 Credit: 3,793,650 RAC: 0 |
I don't mean to sound repetative -- especially if you DIDN'T miss my last post. Have you made sure to set each project on the machine to "leave in memory when suspended?" I haven't seen anybody post that this has been fixed yet -- it seems to be a long-term problem that affects some computers. It's something that'll have to be fixed before the project goes gold, but for now, to run the project you'll just have to leave it in memory. |
lats Send message Joined: 12 Feb 06 Posts: 1 Credit: 1,673,666 RAC: 0 |
This is all well and good but does it survive a reboot? A number of failures appear after rebooting. Is someone fixing the problem? |
UsedBits Send message Joined: 18 Feb 06 Posts: 1 Credit: 650 RAC: 0 |
The four systems I have running Rosetta produce mostly errors. They will crunch hours, maybe days, then throw an error. I'm going to start the process of removing Rosetta and running something else. This is in the hope that more work and fewer errors are produced. Besides, Rosetta was chosen in error (through ignorance) - in the hopes that my contribution might benefit Parkinson's disease. I had imbarked on loading Seti@Home after a long absense from them and discovered Rosetta and the others. It just seemed more worthwhile to contribute to medicine than little green men. However, that my systems contribute nothing to Rosetta due to the near 100% failure rate, it loses nothing by my absense. Regards, UsedBits |
Paul Smith Send message Joined: 2 Dec 05 Posts: 1 Credit: 7,954 RAC: 0 |
Hi, I'm havibg very similar problems. Hours of computer time then "unrecoverable error for result ......". What sticks in my craw though is that you rceive no credit for all the time spent. The fault must lie with Rosetta as every time I look at the results of other participants crunching the same unit, same problem. Rosetta seems to just shrug and say its not our problem, its your computer. Well I hope that attitude will change soon or I'm taking my compter time elsewhere. |
Carlos_Pfitzner Send message Joined: 22 Dec 05 Posts: 71 Credit: 138,867 RAC: 0 |
Seems to me that the error rate reduces when I run only one project by pc *Thus no app suspend/resume , less swap usage , etc. to get this, click on "no more work" for all projects, except the one u choose to run on this pc. *into some time, u will running only 1 project. and the benefits of increased stability , by less swap, no suspend/resume... *If u have two pcs u can run either the same project on both pcs, or run two projects, one project into 1 pc, other project into another pc. btw: There is a new "Life sciences project", I believe is worth trying http://qah.uni-muenster.de/scientific.php http://qah.uni-muenster.de/create_account_form.php?teamid=37 visit our team forum http://www.fadbeens.co.uk/phpBB2/ Click signature for global team stats |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... Rosetta seems to just shrug and say its not our problem, its your computer. Well I hope that attitude will change soon or I'm taking my computer time elsewhere. Totally agree. I'd like to add some perspective on this. When this thread was started around last Xmas, Rosetta was going through a very bad spot. While the programmers involved with Rosetta were experienced in building and running software for mainframes and/or small grids of computers, I think it is fair to say that none of the team had experience in the complexities that come in when running on a grid of tens of thousands of differing machines. It is also fair to say that the project was over-ambitious in the work it tried to put into the system around last Xmas. In the three months between then and now, the team seem to have worked hard on addressing these issues. The Ralph project has been created from scratch as a first-filter to catch the more serious problems before they reach the production level project. That project is beginning to deliver, but not all the outstanding issues from last Xmas have been caught yet, if I am not mistaken. It is still true that some machines are more vulnerable than others to the outstanding problems. In the short term the only advice that can be given to someone with such a machine is that it seems to be a machine specific issue - that does not mean that the solution is not being looked for, it means that in all fairness it is one of many outstanding issues and the project team do not advise you to try to hold your breath while they find that particular issue. The project has always said it aims to be the best BOINC project. In terms of feedback on the scientific meaning of the work, I believe Rosetta has almost always met that target. In terms of delivering a robust app across diverse platforms they have not done so well. There are reasons for that. One reason last year was being over ambitious - they've realised that and back tracked a bit, and quite rightly so. There is another reason that will never go away. Rosetta aims to develop and test a multiplicity of different approaches to the same protein folding problem. Rosetta therefore has one more degree of diversity than all the other projects - it is running diverse apps where they are running single apps, and to compare that diversity is part of the point. Even if the Rosetta team were as experienced as the SETI crew, we'd therefore not expect the Rosetta code to cope with as many corner cases as the SETI code does, for example. The team are on a learning curve. The first and most important thing is that they were willing to learn from users from day one, and they have done so, and (it seems to me) are continuing to do so. As m9 says, it takes time. If it seems to take too long for your needs, then you are right: maybe this is not the project that best suits you. I think the folks at Rosetta would respect that choice, even though they'd like the benefit of your cpu power. But please, if you do leave, don't think that the team here are ignoring the users, they are working hard and trying to balance priorities. There was a poster on the wall one place I worked, about how when you are up to your neck in alligators it is easy to forget that you came to drain the swamp. Well Rosetta found a fair few alligators last Xmas. The point for me is not that some of them are still there, but that some of them have gone, and that some drainage has been going on as well. That, for me, is far more important than quibbling over whether they have got the exact balance right. River |
Message boards :
Rosetta@home Science :
90% failure rate
©2025 University of Washington
https://www.bakerlab.org