Rosetta Checks

Message boards : Number crunching : Rosetta Checks

To post messages, you must log in.

AuthorMessage
Profile Robert Gammon

Send message
Joined: 9 Nov 07
Posts: 14
Credit: 969,848
RAC: 0
Message 54702 - Posted: 28 Jul 2008, 16:37:21 UTC

Please forgive my awkward spellig here as this computer has keyoard prolems. The two letters over the space ar will <> show up.

If OIC (see prolems agai) termiates aormally (more prolems), Rosetta losses track of where it was i the WU ad frequetly restarts from zero, occassioally repeatig half the work already doe.

Seti does ot appear to have the same issue. Power fails, XP restart, etc do ot appear to casue ay prolems with the Seti app. It restarts from the last work as expected 99+% of the time.

The Rosetta moderator says "Susped the app efore exitig OIC to avoid this kow prolem" Power failure ad XP lockup o this laptop make that almost impossile to do.

ID: 54702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 54713 - Posted: 29 Jul 2008, 0:40:03 UTC

Robert, you might try copy/paste for the letters you are unable to type.

What Robert is asking about is how exiting BOINC and restarting sometimes causes a task to begin again from the start. And, based on his prior post, he has noticed that if he suspends a task, rather then exiting BOINC, that is does not start over again.

Robert, any time Rosetta is ended, (not just suspended, but ended) it will have to restart from it's last checkpoint. Checkpoints are saved periodically. But different tasks are able to checkpoint more or less frequently then others. In your case you are seeing as much as a hour or two lost. For tasks with very long running models, and infrequent checkpoints, this is normal. And yes, the Project Team is aware that valueable work is being lost. And they are always working to add more checkpoints over time to the tasks that presently do not checkpoint frequently, or in some cases, they only checkpoint after each completed model.

Rosetta doesn't want to grind your hard disk away by writing all the time. The takes time away from crunching. So there is a fine line to walk here between checkpoint too frequently, and not frequently enough.

All Rosetta tasks checkpoint when a model is completed. You can see this on the website by looking at your results and seeing the number of "decoys" produced. As Rosetta is running on your machine, you can see in the graphic, the current model you are working on and get a feel for how frequently a new model is started.

These models is how Rosetta is able to give you the ability to set your own runtime preference (see Rosetta preferences in your profile). They basically just keep doing more models until your preference is reached. ok, ok, it is a little harder then that... take the current average of time taken per model, and if that estimate indicates that doing one more model would exceed your runtime preference by much, then end things now and report the task.

So that is why you see some variation in how long each task takes to run. And you must complete at least one model do have anything useful to report. So, even if your runtime preference is very low (1hr for example) you will see tasks running as long as several hours before completing. It took that long to complete the first model. And BOINC doesn't have a way to know ahead of time that the models will take a long time to run, so it just uses it's historical observation of how long work is taking your machine to complete. So, if most tasks have actually completed in an hour, your machine will estimate an hour of runtime. If the task then takes 3 hours to complete the first model, you will note the estimated time remaining gets down to around 10 minutes and then diminishes exponentially slowly from there. So BOINC will estimate there are about 10 minutes left for the last 2 hours of running. Not ideal, but for the 3hr default runtime, and higher, it's generally pretty close.
Rosetta Moderator: Mod.Sense
ID: 54713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Robert Gammon

Send message
Joined: 9 Nov 07
Posts: 14
Credit: 969,848
RAC: 0
Message 54751 - Posted: 30 Jul 2008, 14:59:32 UTC - in response to Message 54713.  

Robert, you might try copy/paste for the letters you are unable to type.

What Robert is asking about is how exiting BOINC and restarting sometimes causes a task to begin again from the start. And, based on his prior post, he has noticed that if he suspends a task, rather then exiting BOINC, that is does not start over again.

Robert, any time Rosetta is ended, (not just suspended, but ended) it will have to restart from it's last checkpoint. Checkpoints are saved periodically. But different tasks are able to checkpoint more or less frequently then others. In your case you are seeing as much as a hour or two lost. For tasks with very long running models, and infrequent checkpoints, this is normal. And yes, the Project Team is aware that valueable work is being lost. And they are always working to add more checkpoints over time to the tasks that presently do not checkpoint frequently, or in some cases, they only checkpoint after each completed model.

Rosetta doesn't want to grind your hard disk away by writing all the time. The takes time away from crunching. So there is a fine line to walk here between checkpoint too frequently, and not frequently enough.

All Rosetta tasks checkpoint when a model is completed. You can see this on the website by looking at your results and seeing the number of "decoys" produced. As Rosetta is running on your machine, you can see in the graphic, the current model you are working on and get a feel for how frequently a new model is started.



I have RAC of over 7, so Rosetta has some experiece with how much time is required. My losses are usually more like 3 hours, vs the 1 to 2 hours quoted
here.

I disagree somewhat with the moderator's commets. If I do a orderly shutdow, Suspedig Rosetta, the Exitig OIC, the Shutdow XP, powerup at a ew locatio, restart OIC, Resume Rosetta, this SHOULD restart AT or close to the exit, i.e. if we are at 92% complete, if should restart at 92% or very very close to that. Most of the time, it goes to 0.0%
ID: 54751 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Robert Gammon

Send message
Joined: 9 Nov 07
Posts: 14
Credit: 969,848
RAC: 0
Message 54752 - Posted: 30 Jul 2008, 15:01:03 UTC

Cut paste assumes that I have a file here that will have the lost keys

I do ot have such a file
ID: 54752 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,821,902
RAC: 15,180
Message 54754 - Posted: 30 Jul 2008, 15:47:03 UTC - in response to Message 54751.  

Robert, you might try copy/paste for the letters you are unable to type.

What Robert is asking about is how exiting BOINC and restarting sometimes causes a task to begin again from the start. And, based on his prior post, he has noticed that if he suspends a task, rather then exiting BOINC, that is does not start over again.

Robert, any time Rosetta is ended, (not just suspended, but ended) it will have to restart from it's last checkpoint. Checkpoints are saved periodically. But different tasks are able to checkpoint more or less frequently then others. In your case you are seeing as much as a hour or two lost. For tasks with very long running models, and infrequent checkpoints, this is normal. And yes, the Project Team is aware that valueable work is being lost. And they are always working to add more checkpoints over time to the tasks that presently do not checkpoint frequently, or in some cases, they only checkpoint after each completed model.

Rosetta doesn't want to grind your hard disk away by writing all the time. The takes time away from crunching. So there is a fine line to walk here between checkpoint too frequently, and not frequently enough.

All Rosetta tasks checkpoint when a model is completed. You can see this on the website by looking at your results and seeing the number of "decoys" produced. As Rosetta is running on your machine, you can see in the graphic, the current model you are working on and get a feel for how frequently a new model is started.



I have RAC of over 7, so Rosetta has some experiece with how much time is required. My losses are usually more like 3 hours, vs the 1 to 2 hours quoted
here.

I disagree somewhat with the moderator's commets. If I do a orderly shutdow, Suspedig Rosetta, the Exitig OIC, the Shutdow XP, powerup at a ew locatio, restart OIC, Resume Rosetta, this SHOULD restart AT or close to the exit, i.e. if we are at 92% complete, if should restart at 92% or very very close to that. Most of the time, it goes to 0.0%

I'm not sure what the benefit of suspending rosetta is? If rosetta has reached a checkpoint then it will save, and if it doesn't, it won't. Hibernating or Standby will mean that Rosetta resumes from its previous position, regardless of whether it has checkpointed, so might be a better option for you (although you can't use standby if cutting the power to the computer!)
ID: 54754 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 54755 - Posted: 30 Jul 2008, 16:28:50 UTC - in response to Message 54751.  

I have RAC of over 7, so Rosetta has some experience with how much time is required. My losses are usually more like 3 hours, vs the 1 to 2 hours quoted here.

I disagree somewhat with the moderator's comments. If I do a orderly shutdown, Suspending Rosetta, then Exiting BOINC, then Shutdown XP, powerup at a new location, restart BOINC, Resume Rosetta, this SHOULD restart AT or close to the exit, i.e. if we are at 92% complete, if should restart at 92% or very very close to that. Most of the time, it goes to 0.0%

{EDIT: Replaced the missing letters.}


Robert,

First of all, I want to thank you for crunching for Rosetta. Problems not withstanding, I personally believe it to be among the best run and most worthy distributed computing projects.

I reviewed the history of the workunits you have completed. As of the time of this post, ALL of the successful completions report something like this:
# cpu_run_time_pref: 7200
# random seed: 3803289
======================================================
DONE :: 1 starting structures 4646.02 cpu seconds
This process generated 1 decoys from 1 attempts
0 starting pdbs were skipped
======================================================


Two things to note from this example:

1) Your CPU runtime preference is 7200 seconds. The BOINC manager noted that it took 4646.2 seconds to complete one model, so it didn't even start the second, since according to its experience (unique to each task) the completion of the next model would have exceeded your runtime preference.
----------

2) Many of the workunits coming out these days can only checkpoint at the completion of a model. Checkpointing is the only way that Rosetta can save the state of a workunit in a way that will survive the shutdown of the host computer. Suspending a workunit saves its state to the pagefile on your hard drive, but as soon as you shut down, the pagefile is lost.

Since all of the results your computer has been reporting have been "1 decoy from 1 attempt" it is likely that every time you shutdown, Rosetta has to start over again from scratch on that task.
-----------

SETI@home performs a fundamentally different task, orders of magnitude simpler that the tasks performed by Rosetta@home. It should not surprise us that the SETI workunits are more resilient to shutdowns, whether expected and orderly, or otherwise.

As dcdc points out in the preceding post, it may be possible for you to preserve the state of a workunit by using Standby or Hibernate rather than shutting down when you need to relocate.

I hope you will continue to crunch for Rosetta :-)

Respectfully,
David

Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 54755 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
drghughes

Send message
Joined: 27 Apr 07
Posts: 7
Credit: 6,346
RAC: 0
Message 54763 - Posted: 30 Jul 2008, 20:35:24 UTC

Hi Robert!

Have a look at http://boinc.berkeley.edu/wiki/Client_configuration

This is the documentation for the BOINC client configuration file. If you set the <checkpoint_debug> flag to 1 (default is 0), you will at least be able to see when workunits checkpoint.

If you can wait until then before shutting down, you won't lose the work that you've done. If you can't, well at least you'll be making an informed decision.

Note that the configuration file is loaded when BOINC starts, or if you go to Advanced - Read config file, i.e. just creating it isn't enough.

ID: 54763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
l_mckeon

Send message
Joined: 5 Jun 07
Posts: 44
Credit: 180,717
RAC: 0
Message 54765 - Posted: 30 Jul 2008, 23:16:55 UTC - in response to Message 54763.  

Hi Robert!

Have a look at http://boinc.berkeley.edu/wiki/Client_configuration

This is the documentation for the BOINC client configuration file. If you set the <checkpoint_debug> flag to 1 (default is 0), you will at least be able to see when workunits checkpoint.



OR just go into the BOINC/SLOTS program folder. You'll find sub-folders called 0, 1, 2, 3 etc, one per current running Work Unit. The last time that each SLOTS sub-folder was modified
is the time of the last check point.

I'm not sure if that applies when the Boinc manager is switching between WUs from different projects

I have an old freeware file manager program called EF Commander that I keep permanently pointed at this location for just this purpose, although of course Windows Explorer will do.

Having said that, is it really necessary for the WU to run as just one model with no check points or is that just poor/lazy design?

ID: 54765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 54769 - Posted: 31 Jul 2008, 2:21:14 UTC

Different types of tasks have different abilities to take checkpoints. Over time, they are always working towards adding more checkpoints to tasks. Especially to tasks that do not presently have any. However, they are also adding over time more new types of tasks. So, each type actually has to endure and mature a little, before it becomes clear whether it is going to show promising results. Only then can you know if it is worth further refinement and investment to add further coding to take checkpoints.

So, I couldn't find any of your adjectives that I could agree with. I see it as a natural evolution. Many of the types of tasks already checkpoint frequently. However, CASP is just wrapping up, and CASP proteins and techniques are the cutting edge stuff proving itself. So, this Summer season of crunching has perhaps shown more of the tasks that do not checkpoint much, then you would see normally.
Rosetta Moderator: Mod.Sense
ID: 54769 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Robert Gammon

Send message
Joined: 9 Nov 07
Posts: 14
Credit: 969,848
RAC: 0
Message 54819 - Posted: 2 Aug 2008, 1:36:00 UTC - in response to Message 54769.  

Different types of tasks have different abilities to take checkpoints. Over time, they are always working towards adding more checkpoints to tasks. Especially to tasks that do not presently have any. However, they are also adding over time more new types of tasks. So, each type actually has to endure and mature a little, before it becomes clear whether it is going to show promising results. Only then can you know if it is worth further refinement and investment to add further coding to take checkpoints.


So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!!

The laptop in question has no battery. In this respect, it behaves more like a desktop computer that does not run 24x7. Other BOINC apps may or may not have the same characteristic. Seti work units are roughly 10x bigger, that is take roughly 10x longer to process on the same hardware than Rosetta work units and they have always been long work units to process. Checkpointing keeps us from losing Seti work, but not Rosetta.

And since the laptop cannot type 'n' or 'b', creating a file named cc_config.xml is quite impossible.

Should be able to get a new one tomorrow night.
ID: 54819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 54824 - Posted: 2 Aug 2008, 8:26:43 UTC

So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!!


I think that is being a little too dramatic.
Each time I shut down the computer for whatever reason, I am always surprised how little time I have lost. With the units I have been given the lost time is usually less than 10 minutes.

To get around your key problem plug in an old keyboard. It will solve your problem and also make typing easier .
ID: 54824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,821,902
RAC: 15,180
Message 54827 - Posted: 2 Aug 2008, 11:07:03 UTC - in response to Message 54819.  

So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!!

The laptop in question has no battery. In this respect, it behaves more like a desktop computer that does not run 24x7. Other BOINC apps may or may not have the same characteristic. Seti work units are roughly 10x bigger, that is take roughly 10x longer to process on the same hardware than Rosetta work units and they have always been long work units to process. Checkpointing keeps us from losing Seti work, but not Rosetta.

And since the laptop cannot type 'n' or 'b', creating a file named cc_config.xml is quite impossible.

Should be able to get a new one tomorrow night.

If the guys at the Bakerlab could improve the checkpointing then i'm sure they would - it's in their best interests! Unfortunately it might well be that it's just not practical without writing everything in memory to disk, which would have all kinds of implications for disk usage and shutdown times etc. Maybe it's an option that could be added... but without that it's an unfortunate aspect of the research being done.

Ironically, faster computers are less affected by this problem within a given timeframe because they will reach checkpoints (even if that means they've had to complete a decoy to reach one) more quickly and therefore lose less computer time (well wall-time anyway). Also, I'm fairly sure many of the tasks do still checkpoint within a model, so you won't necessarily lose all of the work on a given decoy if you restart.
ID: 54827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 54833 - Posted: 2 Aug 2008, 14:07:51 UTC - in response to Message 54819.  
Last modified: 2 Aug 2008, 14:09:21 UTC

{...}
So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!!
{...}


In a word: no.

At the time of this posting, the last nine workunits reported by your computer all consisted of one model. The average time to complete that one model was 5999 seconds (about an hour and forty minutes).

For that computer, if you leave it on for less than an hour and forty minutes at a stretch, there is a chance you will lose all the work.

{...}
Should be able to get a new one tomorrow night.


If you change computers your results will likely change also. I hope that the new computer will allow you to work more comfortably and efficiently, and that (if you choose to continue) it will allow you to complete Rosetta workunits with less frustration.
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 54833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Robert Gammon

Send message
Joined: 9 Nov 07
Posts: 14
Credit: 969,848
RAC: 0
Message 54935 - Posted: 5 Aug 2008, 23:37:52 UTC - in response to Message 54824.  

So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!!


I think that is being a little too dramatic.
Each time I shut down the computer for whatever reason, I am always surprised how little time I have lost. With the units I have been given the lost time is usually less than 10 minutes.

To get around your key problem plug in an old keyboard. It will solve your problem and also make typing easier .


===========================================
My losses are measured to as much as 4 hours, 1/6th of hour is great if i could get that.

old keyoard is ot availale ad i have o trasport or cah to uy oe
ID: 54935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 54941 - Posted: 6 Aug 2008, 1:56:42 UTC

Robert, I realize that there are times that 4 hours pass without a checkpoint to preserve the work. I just wanted you, and others that read this, to realize that there are a number of factors involved, and that the majority of the time, Rosetta does significantly better at checkpointing.
Rosetta Moderator: Mod.Sense
ID: 54941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 54989 - Posted: 7 Aug 2008, 22:39:50 UTC

old keyoard is ot availale ad i have o trasport or cah to uy oe

Another way around, but not very convenient is to use the screen keyboard. It is a bit slower as you have to use the mouse pointer. You can find it in programs > accessories > accessibility.
ID: 54989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Rosetta Checks



©2024 University of Washington
https://www.bakerlab.org