Message boards : Number crunching : Problems with version 5.90/5.91
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
if it's so common, why wasn't the linux problem picked up on RALPH??? Appears work actually completes normally, just the progress indicator not looking right along the way. So you would actually have to watch it run to see any problem. Rosetta Moderator: Mod.Sense |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
if it's so common, why wasn't the linux problem picked up on RALPH??? Work might complete eventually, but it definately doesn't complete normally. Every WU I've watched has gone past my runtime limit by hours. I suspect the only way the WUs will complete on their own is when Rosetta's internal timelimit kicks in (6x the runtime limit isn't it?). And that's assuming the built-in limit is even working. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
All linux users -- thanks for posting! Its quite interesting, in the past, we've seen issues that were Windows-specific, then Mac-specific, but typically linux has been robust (especially since the app doesn't have graphics). We're looking into the current Rosetta@home/linux issue (I think the cpu time call must be messed up in the latest boinc api), but it may take a few days to track it down. In the meanwhile, please feel free to switch to another app. Apologies... there aren't that many linux users on RALPH -- if you're interested in helping out, we'd be grateful if some more linux clients attached to ralph at least part time. |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
All linux users -- thanks for posting! Its quite interesting, in the past, we've seen issues that were Windows-specific, then Mac-specific, but typically linux has been robust (especially since the app doesn't have graphics). Just to reply to my previous post today.... The WU are finishing with beta 5.90, but it went 24 minutes over my preference time (2 hours). The status takes almost 10 minutes to update in BOINCMGR, but the WU did finish OK. Reporting results now; here is the result that finished: 128257568 |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work. All linux users -- thanks for posting! Its quite interesting, in the past, we've seen issues that were Windows-specific, then Mac-specific, but typically linux has been robust (especially since the app doesn't have graphics). |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
OK, just did the update -- this should revert the "cpu run time" and "% complete" behavior to what linux clients are used to! Please let me know if this fixes this issue (looks good locally). Also, there were complaints about memory usage for Rosetta 5.89 -- have these problems become better? Thanks for the continuing feedback! Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work. |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
OK, just did the update -- this should revert the "cpu run time" and "% complete" behavior to what linux clients are used to! Please let me know if this fixes this issue (looks good locally). Thanks Rhiju! I haven't had any WUs that have run to completion yet, but at least the CPU time is incrementing :-) I'll check on my clients in the morning. |
Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0 |
If 5.90 had been tested on Ralph, it never would have made it here in broken form. Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
If 5.90 had been tested on Ralph, it never would have made it here in broken form. You could criticize or you could help. Try attaching to the Ralpha project with a linux machine. Rhiju said they needed more linux testers. I'm just thankful they responded to the issues in this thread quickly, so not much science will be lost. |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
If 5.90 had been tested on Ralph, it never would have made it here in broken form. I have 10 Linux cores on Ralph with only 3 WUs. Server status is "zero" queued. I guess you could say I criticize and "try" to help. There is still no coherent explanation why something is tested on Ralph for only one day before it gets implemented on Rosetta. |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work. Since you tracked down the problem, can you please tell us how it will effect all those of us running Rosetta on Linux ? We already know that those 5.90 tasks will not finish after the specified runtime. Without manual intervention, will these tasks ever end on their own or do I have to go to each and every server and manually abort all the 5.90 tasks ? I have over 100 cpus running Rosetta on Linux and having to clean up this mess is not something I'm looking forward to. It especially upsets me that the lack of testing on Ralph caused the problem to appear in Rosetta. This was clearly avoidable! Team Helix |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hi: Here's a partial explanation. On ralph, nearly all the workunits for a full day had returned and come back as "successes", typically a very good sign -- but the linux issue, as you correctly pointed out, leads to delayed responses from clients (rather than a bunch of immediate WU errors that tell us to go track down the problem). Since there are very few RALPH linux users we didn't notice a drop in the overall return rate of successes. The only sign that things were wrong were from a message board posting there (later bolstered by your and others' posts) and here ... So, thanks for posting -- it did help us catch the problem relatively quickly -- and please accept our apologies. We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. if you could recruit some more Rosetta@home linux users to give a fraction of their CPUs to ralph and occasionally post errors in the message boards, that would also help! Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work. |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
I suspended some jobs to force a couple of 5.91 jobs to run. I'm happy to report they ran without problems. 5.91 seems good. |
Stephen Glick Send message Joined: 10 Dec 05 Posts: 3 Credit: 2,534,655 RAC: 18 |
Hi. I have been running Rosetta and SETI for a long time. Recently Rosetta is taking over my computer with some sort of animated screen saver that won't quit. When I quit the screen saver, it just starts right up again. I don't want it, but can't seem to get rid of it. I've even tried deleting it, but it just recreates itself and comes right back. How can I get rid of this thing? If I can't, I'm going to quit doing Rosetta and allocate all my processing time to SETI. My computer is a Mac G-5 2.3 dual processor. Thanks. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Stephen is on a Mac, so I'm not sure on the details. On Windows, you have to specifically set your screensaver to "BOINC" to see it, but I think that is the default as you install. And when it is working on SETI, you would set the SETI screensaver as well. So, check what you have set your screensaver to. Rosetta Moderator: Mod.Sense |
tng* Send message Joined: 28 Oct 05 Posts: 14 Credit: 5,389,798 RAC: 0 |
3 boxes running CentOS now -- plan to convert more (all except the laptops, and maybe them too). How many do you need on Ralph? Hi: Here's a partial explanation. On ralph, nearly all the workunits for a full day had returned and come back as "successes", typically a very good sign -- but the linux issue, as you correctly pointed out, leads to delayed responses from clients (rather than a bunch of immediate WU errors that tell us to go track down the problem). Since there are very few RALPH linux users we didn't notice a drop in the overall return rate of successes. The only sign that things were wrong were from a message board posting there (later bolstered by your and others' posts) and here ... |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,440,203 RAC: 9,762 |
3 boxes running CentOS now -- plan to convert more (all except the laptops, and maybe them too). How many do you need on Ralph? Windows XP - Intel Q6600 with 2MB RAM I recently aborted this WU because it was using 0% CPU. I suspended the process and resumed it twice with no change. It was locked at exactly 43:00 cpu time. This is the second or third WU I have been forced to abort in the last few days. 13 WUs completed prior to this issue so I think my hardware is OK. https://boinc.bakerlab.org/rosetta/result.php?resultid=128495612 Thx! Paul |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
I have posted the steps I'm taking to recover from the 5.90 problem on my Linux systems in the Ralph forum . Perhaps this is useful to other Linux users. Team Helix |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
There seems to be a problem with the 1zpy__BOINC_TWIST_RINGS_SYMM_FOLD_AND_DOCK-1zpy_-native__2470 jobs. So far, the watchdog has killed 6 out of 7 jobs. https://boinc.bakerlab.org/rosetta/result.php?resultid=128342607 https://boinc.bakerlab.org/rosetta/result.php?resultid=128342579 https://boinc.bakerlab.org/rosetta/result.php?resultid=128342572 https://boinc.bakerlab.org/rosetta/result.php?resultid=128344197 https://boinc.bakerlab.org/rosetta/result.php?resultid=128342732 https://boinc.bakerlab.org/rosetta/result.php?resultid=128342735 |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
This one just errored out: 128154725 This is on a windows XP box. Rosetta asked Zone Alarm for access to the net. I gave permission and it killed itself. Tim |
Message boards :
Number crunching :
Problems with version 5.90/5.91
©2024 University of Washington
https://www.bakerlab.org