Why can't Rosetta checkpoint more often (compared to WCG)? +feedback

Message boards : Number crunching : Why can't Rosetta checkpoint more often (compared to WCG)? +feedback

To post messages, you must log in.

AuthorMessage
Nemesis
Avatar

Send message
Joined: 12 Mar 06
Posts: 149
Credit: 21,395
RAC: 0
Message 38184 - Posted: 23 Mar 2007, 21:31:34 UTC - in response to Message 38182.  

Rosetta, 1st credit, invalidated after many hours. WCG, "finished", not updated on website, I quit distributed computing for a while (couple days). Then I come back to WCG to see if it updated.. and it did! Whoo, after many hours, it didn't reset, the timer or the checkpoint (at least not significantly)...

I'm wondering. Can anyone explain the process of Rosetta processing vs WCG FightAidsAtHome/Genome Comparison (the only two I've done/doing)? Why those two can checkpoint at good intervals, while Rosetta goes for hours at 1%, I exit, then I have no idea if I start it again, timer resets to 0, I don't know if "actual" % is 1 or 4 hours worth.

For example, Rosetta processing is like a house of cards in the face of the wind, it must always need your "shielding".. Or with a PC, when your "shield" or RAM goes away, house of cards goes away. That would be my example of why Rosetta is quirky?
===
"Q: Progress Percent not advancing?
A: Rosetta recomputes the progress percent at the end of each model."
Ok... why does WCG's seem to know "how much" is total/needed/done? Can someone explain the differences in the workloads..

Q: "To completion" time is going UP!
Answer is not normal.. Come on, 1 second increments? How about recalculating it every ~10 minutes or something so that you won't have the randomness of download managers but still... a guesser that makes sense.


You're singing my song!

Maybe this will become my personal crusade - to get the 1% and Completion Time problems fixed.

Right now, there has been no acknowledgement that the Rosetta programmers are working on it, or that they intend to work on it.

BTW, there is an entire thread devoted to this topic.

Nemesis n. A righteous infliction of retribution manifested by an appropriate agent.


ID: 38184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nemesis
Avatar

Send message
Joined: 12 Mar 06
Posts: 149
Credit: 21,395
RAC: 0
Message 38186 - Posted: 23 Mar 2007, 22:24:31 UTC - in response to Message 38185.  

I realize that..
I don't think my question of why Rosetta doesn't checkpoint more often, why Rosetta resets timer (maybe everything also) has been answered.. Someone said it's because of "dumping memory" (for timer reset) but WCG is also set to "dump memory" but it doesn't reset time.

Because Rosetta doesn't checkpoint until the end of the model, if it's stopped it has to start over from the last completed model, or if in the first model from the beginning, and the clock starts over as well if in the first model.

I've never run WCG, but it sounds like it does a checkpoint and saves the crunching time info when you stop it. That's totally up to the science app programmers and how they decide to do it.

Nemesis n. A righteous infliction of retribution manifested by an appropriate agent.


ID: 38186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38196 - Posted: 23 Mar 2007, 23:25:54 UTC - in response to Message 38187.  

So I'd like to hear from a Rosetta dev why they can't resume work (save often) in the middle of a crunching...


Bin Qian's comments from when checkpointing was originally added to Rosetta almost a year ago. As mentioned in that thread, the new version of BOINC also has new features to try and preempt one project to begin another only at a checkpoint.

Rosetta Moderator: Mod.Sense
ID: 38196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38234 - Posted: 24 Mar 2007, 15:06:09 UTC

Well, Rosetta doesn't count in 1 second increments. It's not like Rosetta made a calculation every second, showing the time remaining increasing. It's just BOINC showing you *IT's* guess based on the increasing CPU time. So, 1 second of time passing increases the CPU time used by ~1 second, and BOINC takes the total CPU time used so far, along with the % complete, and it's history on how long it took you to complete tasks in the past and shows you the result as estimated time to completion. You can see this better if you let a task run longer. Later in the run, when % completed is over 50%, the runtime still increases one second at a time, but the estimated time to completion doesn't change every second.

I mention this simply to point out that the numbers you are observing are a level removed from the numbers Rosetta's programs are working with. So it further complicates reaching the goal of a smoothly declining timeline.

I don't know all the details about how the numbers get revised and how they are communicated back to the BOINC Manager. Nor am I the one that can improve how it works. I'm just trying to explain the parts that I can. The need for, and benefits of improvement are pretty clear. So, I'm confident we will see some improvements in future releases.
Rosetta Moderator: Mod.Sense
ID: 38234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 38238 - Posted: 24 Mar 2007, 15:42:48 UTC - in response to Message 38234.  

I don't know all the details about how the numbers get revised and how they are communicated back to the BOINC Manager. Nor am I the one that can improve how it works. I'm just trying to explain the parts that I can. The need for, and benefits of improvement are pretty clear. So, I'm confident we will see some improvements in future releases.


And still, the developers and real project people are strangely silent on this.

No comments or acknowledgements in the 1% thread.

Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 38238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 38254 - Posted: 24 Mar 2007, 20:34:36 UTC - in response to Message 38246.  

Hi John, Mod.Sense, and others:

Thanks for bringing this up. More checkpointing and better time-to-completion feedback were big causes of controversy last year (and again now!) -- we did put checkpointing into larger jobs, but never really addressed the problem of accurately estimating time to completion. We've been too busy getting rid of early bugs and putting new science modes into Rosetta! Things have settled down, though. the development team will discuss both issues early next week.

Thanks,
Rhiju

http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=11332
"Help Cure Muscular Dystrophy: When the PDB code / Protein Symbols in the 'I' screen left hand bottom change and the 2 proteins in the main graph assume the same colours. Colour changes are pure random, thus could on outside chance assume same colour even if checkpoint was reached. Watch the PDB code change for absolute indication! (See Sample Image and FAQ for description)

Genome Comparison: Approximately every 20 minutes (See Sample Image and FAQ for description)

Help Defeat Cancer: at 25% intervals - writes large files (See Sample Image and FAQ for description)

Human Proteome Folding 2: Occurs after each structure attempt. Look at the graphics, one can see how far along an attempt is. When the 3 line graphs reach the end of the X axis and restart at the left, the structure attempt is complete and a checkpoint occurs (See Sample Image for UD Agent, BOINC Agent and FAQ for description)

FightAIDS@Home: When the Best Energy C graph green line has reached the end and returns to the beginning, whilst rescaling the graph and adding a red line indicating the path of the previous attempt. (See Sample Image and FAQ for description)"

I would imagine that each of these workloads are a good amount different, yet they are each able to save progress in a mindful matter...


ID: 38254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[STS]LoB

Send message
Joined: 18 Mar 07
Posts: 4
Credit: 678,612
RAC: 0
Message 38995 - Posted: 4 Apr 2007, 19:45:50 UTC - in response to Message 38254.  

Hey Rhiju, has Version 5.59 increased the checkpointing frequency? I ask because of the heavily increased rate of updates to the progress display (xx%)...


Hi John, Mod.Sense, and others:

Thanks for bringing this up. More checkpointing and better time-to-completion feedback were big causes of controversy last year (and again now!) -- we did put checkpointing into larger jobs, but never really addressed the problem of accurately estimating time to completion. We've been too busy getting rid of early bugs and putting new science modes into Rosetta! Things have settled down, though. the development team will discuss both issues early next week.

Thanks,
Rhiju

ID: 38995 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38997 - Posted: 4 Apr 2007, 21:23:25 UTC

Version 5.59 tackled the % complete. Additional checkpoints will be added in the coming weeks. See Rhiju's post on the Ralph boards.
Rosetta Moderator: Mod.Sense
ID: 38997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[STS]LoB

Send message
Joined: 18 Mar 07
Posts: 4
Credit: 678,612
RAC: 0
Message 38998 - Posted: 4 Apr 2007, 21:25:39 UTC - in response to Message 38997.  

Thanks!

Version 5.59 tackled the % complete. Additional checkpoints will be added in the coming weeks. See Rhiju's post on the Ralph boards.

ID: 38998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Why can't Rosetta checkpoint more often (compared to WCG)? +feedback



©2025 University of Washington
https://www.bakerlab.org