Check pointing needs fixxed

Message boards : Number crunching : Check pointing needs fixxed

To post messages, you must log in.

AuthorMessage
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 39632 - Posted: 20 Apr 2007, 0:19:30 UTC
Last modified: 20 Apr 2007, 0:37:44 UTC

I don't run applications non stop on my home computer. Yes, I pause and restart between different projects. But everytime I get above about 20% with 10 to 15 hours and pause, then restart, I loose all work, going back to 0%. I want the old check pointing system back.

Ron

edit - It's only been a few weeks like this.

edit2 - And yes, I leave the application in memory
ID: 39632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 39633 - Posted: 20 Apr 2007, 5:12:41 UTC

Hi Ron

If you check your Wu when it has restarted does the CPU-time reset to 0?

And if you check the grafics what model does it start at?

Anders n

ID: 39633 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 39639 - Posted: 20 Apr 2007, 8:44:49 UTC - in response to Message 39633.  

Hi Ron

If you check your Wu when it has restarted does the CPU-time reset to 0?

And if you check the grafics what model does it start at?

Anders n

No, just the progress goes to 0%. And no, I've not checked graphics.
ID: 39639 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 39644 - Posted: 20 Apr 2007, 11:57:52 UTC

You don't loose more work now than you did before it's just % to finish
that it off when you restart Boinc.

Hope they find a fix for it soon.

Happy Chrunching

Anders n
ID: 39644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 39646 - Posted: 20 Apr 2007, 13:41:03 UTC

*grumble* After getting to about 12% it restarted, yet again. Graphics show it in the "ab initio + relax" start up stage.

Is it worth running rosetta at all till this is fixxed?
ID: 39646 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 39653 - Posted: 20 Apr 2007, 15:11:13 UTC

Ron, I'd say your results speak for themselves. The only ones that have any sign of error, are the ones that were aborted by user.

Anders n is correct, the only thing that's really changed here is that the progress % completed is updated more frequently. It used to only update at the end of each model. Now it updates every 5 seconds. And with that new change there is a new quirk, that being that upon a full restart of a task (i.e. it was removed from memory), the % complete will start at zero, even though many hours of work may be retained in the task already. These will still finish at the normal time. It is simply the indication of how far along we are in the task that is incorrect.

...and yes, I've been saying many many times in threads throughout the boards that the checkpointing is the next big thing (from a user's point of view) that we will see in the next Rosetta release.
Rosetta Moderator: Mod.Sense
ID: 39653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 39667 - Posted: 20 Apr 2007, 16:07:26 UTC - in response to Message 39653.  

Ron, I'd say your results speak for themselves. The only ones that have any sign of error, are the ones that were aborted by user.

Anders n is correct, the only thing that's really changed here is that the progress % completed is updated more frequently. It used to only update at the end of each model. Now it updates every 5 seconds. And with that new change there is a new quirk, that being that upon a full restart of a task (i.e. it was removed from memory), the % complete will start at zero, even though many hours of work may be retained in the task already. These will still finish at the normal time. It is simply the indication of how far along we are in the task that is incorrect.

...and yes, I've been saying many many times in threads throughout the boards that the checkpointing is the next big thing (from a user's point of view) that we will see in the next Rosetta release.

I'll see what this run does. 15 hours + 17% done after two restarts. Yes, I aborted earlier ones because it liked they had lost progress.
ID: 39667 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Purple Rabbit
Avatar

Send message
Joined: 24 Sep 05
Posts: 28
Credit: 4,296,740
RAC: 3,006
Message 39668 - Posted: 20 Apr 2007, 16:09:14 UTC

I think one of the unintended consequences of the new user friendly progress bar is that you don't know if a model has completed. I used to wait for 10%, 20%, etc. before doing anything drastic with BOINC. Now these points don't seem to necessarily correspond to the end of a model. My anecdotal and non scientific observations show that I may have thrown away some work because the progress is only updated to reality at the end of each model.

This isn't a complaint. It's an observation. It may also be an example of: "You can't please everyone no matter what you do" :-)
ID: 39668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 39677 - Posted: 20 Apr 2007, 20:59:51 UTC

It looks like there is a bug in the % complete logic. It should jump back to the last checkpointed value after a restart. We'll fix this for the next release. You can't really keep track of the models via the boinc manager but the rosetta graphics has the model number. The % progress and time to complete are estimates due to the variable run times per model and the per model time resolution used to prevent going over the run time pref.
ID: 39677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Purple Rabbit
Avatar

Send message
Joined: 24 Sep 05
Posts: 28
Credit: 4,296,740
RAC: 3,006
Message 39727 - Posted: 22 Apr 2007, 15:27:04 UTC - in response to Message 39677.  
Last modified: 22 Apr 2007, 15:56:06 UTC

You can't really keep track of the models via the boinc manager but the rosetta graphics has the model number.

Well yes, but most of my computers are remote Linux machines...sigh. It's a PITA, but certainly not a show stopper. The BOINC progress report is all I have so any efforts to make it more accurate with respect to model completion will be appreciated (but you knew that already!). I only mention this as constructive criticism so you know the problem I have and with the hope that the next version will be better.
ID: 39727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Check pointing needs fixxed



©2024 University of Washington
https://www.bakerlab.org