Message boards : Number crunching : Rosetta Checks
Author | Message |
---|---|
Robert Gammon Send message Joined: 9 Nov 07 Posts: 14 Credit: 969,848 RAC: 0 |
Please forgive my awkward spellig here as this computer has keyoard prolems. The two letters over the space ar will <> show up. If OIC (see prolems agai) termiates aormally (more prolems), Rosetta losses track of where it was i the WU ad frequetly restarts from zero, occassioally repeatig half the work already doe. Seti does ot appear to have the same issue. Power fails, XP restart, etc do ot appear to casue ay prolems with the Seti app. It restarts from the last work as expected 99+% of the time. The Rosetta moderator says "Susped the app efore exitig OIC to avoid this kow prolem" Power failure ad XP lockup o this laptop make that almost impossile to do. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Robert, you might try copy/paste for the letters you are unable to type. What Robert is asking about is how exiting BOINC and restarting sometimes causes a task to begin again from the start. And, based on his prior post, he has noticed that if he suspends a task, rather then exiting BOINC, that is does not start over again. Robert, any time Rosetta is ended, (not just suspended, but ended) it will have to restart from it's last checkpoint. Checkpoints are saved periodically. But different tasks are able to checkpoint more or less frequently then others. In your case you are seeing as much as a hour or two lost. For tasks with very long running models, and infrequent checkpoints, this is normal. And yes, the Project Team is aware that valueable work is being lost. And they are always working to add more checkpoints over time to the tasks that presently do not checkpoint frequently, or in some cases, they only checkpoint after each completed model. Rosetta doesn't want to grind your hard disk away by writing all the time. The takes time away from crunching. So there is a fine line to walk here between checkpoint too frequently, and not frequently enough. All Rosetta tasks checkpoint when a model is completed. You can see this on the website by looking at your results and seeing the number of "decoys" produced. As Rosetta is running on your machine, you can see in the graphic, the current model you are working on and get a feel for how frequently a new model is started. These models is how Rosetta is able to give you the ability to set your own runtime preference (see Rosetta preferences in your profile). They basically just keep doing more models until your preference is reached. ok, ok, it is a little harder then that... take the current average of time taken per model, and if that estimate indicates that doing one more model would exceed your runtime preference by much, then end things now and report the task. So that is why you see some variation in how long each task takes to run. And you must complete at least one model do have anything useful to report. So, even if your runtime preference is very low (1hr for example) you will see tasks running as long as several hours before completing. It took that long to complete the first model. And BOINC doesn't have a way to know ahead of time that the models will take a long time to run, so it just uses it's historical observation of how long work is taking your machine to complete. So, if most tasks have actually completed in an hour, your machine will estimate an hour of runtime. If the task then takes 3 hours to complete the first model, you will note the estimated time remaining gets down to around 10 minutes and then diminishes exponentially slowly from there. So BOINC will estimate there are about 10 minutes left for the last 2 hours of running. Not ideal, but for the 3hr default runtime, and higher, it's generally pretty close. Rosetta Moderator: Mod.Sense |
Robert Gammon Send message Joined: 9 Nov 07 Posts: 14 Credit: 969,848 RAC: 0 |
Robert, you might try copy/paste for the letters you are unable to type. I have RAC of over 7, so Rosetta has some experiece with how much time is required. My losses are usually more like 3 hours, vs the 1 to 2 hours quoted here. I disagree somewhat with the moderator's commets. If I do a orderly shutdow, Suspedig Rosetta, the Exitig OIC, the Shutdow XP, powerup at a ew locatio, restart OIC, Resume Rosetta, this SHOULD restart AT or close to the exit, i.e. if we are at 92% complete, if should restart at 92% or very very close to that. Most of the time, it goes to 0.0% |
Robert Gammon Send message Joined: 9 Nov 07 Posts: 14 Credit: 969,848 RAC: 0 |
Cut paste assumes that I have a file here that will have the lost keys I do ot have such a file |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,677,569 RAC: 10,479 |
Robert, you might try copy/paste for the letters you are unable to type. I'm not sure what the benefit of suspending rosetta is? If rosetta has reached a checkpoint then it will save, and if it doesn't, it won't. Hibernating or Standby will mean that Rosetta resumes from its previous position, regardless of whether it has checkpointed, so might be a better option for you (although you can't use standby if cutting the power to the computer!) |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
I have RAC of over 7, so Rosetta has some experience with how much time is required. My losses are usually more like 3 hours, vs the 1 to 2 hours quoted here. Robert, First of all, I want to thank you for crunching for Rosetta. Problems not withstanding, I personally believe it to be among the best run and most worthy distributed computing projects. I reviewed the history of the workunits you have completed. As of the time of this post, ALL of the successful completions report something like this: # cpu_run_time_pref: 7200 Two things to note from this example: 1) Your CPU runtime preference is 7200 seconds. The BOINC manager noted that it took 4646.2 seconds to complete one model, so it didn't even start the second, since according to its experience (unique to each task) the completion of the next model would have exceeded your runtime preference. ---------- 2) Many of the workunits coming out these days can only checkpoint at the completion of a model. Checkpointing is the only way that Rosetta can save the state of a workunit in a way that will survive the shutdown of the host computer. Suspending a workunit saves its state to the pagefile on your hard drive, but as soon as you shut down, the pagefile is lost. Since all of the results your computer has been reporting have been "1 decoy from 1 attempt" it is likely that every time you shutdown, Rosetta has to start over again from scratch on that task. ----------- SETI@home performs a fundamentally different task, orders of magnitude simpler that the tasks performed by Rosetta@home. It should not surprise us that the SETI workunits are more resilient to shutdowns, whether expected and orderly, or otherwise. As dcdc points out in the preceding post, it may be possible for you to preserve the state of a workunit by using Standby or Hibernate rather than shutting down when you need to relocate. I hope you will continue to crunch for Rosetta :-) Respectfully, David Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
drghughes Send message Joined: 27 Apr 07 Posts: 7 Credit: 6,346 RAC: 0 |
Hi Robert! Have a look at http://boinc.berkeley.edu/wiki/Client_configuration This is the documentation for the BOINC client configuration file. If you set the <checkpoint_debug> flag to 1 (default is 0), you will at least be able to see when workunits checkpoint. If you can wait until then before shutting down, you won't lose the work that you've done. If you can't, well at least you'll be making an informed decision. Note that the configuration file is loaded when BOINC starts, or if you go to Advanced - Read config file, i.e. just creating it isn't enough. |
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
Hi Robert! OR just go into the BOINC/SLOTS program folder. You'll find sub-folders called 0, 1, 2, 3 etc, one per current running Work Unit. The last time that each SLOTS sub-folder was modified is the time of the last check point. I'm not sure if that applies when the Boinc manager is switching between WUs from different projects I have an old freeware file manager program called EF Commander that I keep permanently pointed at this location for just this purpose, although of course Windows Explorer will do. Having said that, is it really necessary for the WU to run as just one model with no check points or is that just poor/lazy design? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Different types of tasks have different abilities to take checkpoints. Over time, they are always working towards adding more checkpoints to tasks. Especially to tasks that do not presently have any. However, they are also adding over time more new types of tasks. So, each type actually has to endure and mature a little, before it becomes clear whether it is going to show promising results. Only then can you know if it is worth further refinement and investment to add further coding to take checkpoints. So, I couldn't find any of your adjectives that I could agree with. I see it as a natural evolution. Many of the types of tasks already checkpoint frequently. However, CASP is just wrapping up, and CASP proteins and techniques are the cutting edge stuff proving itself. So, this Summer season of crunching has perhaps shown more of the tasks that do not checkpoint much, then you would see normally. Rosetta Moderator: Mod.Sense |
Robert Gammon Send message Joined: 9 Nov 07 Posts: 14 Credit: 969,848 RAC: 0 |
Different types of tasks have different abilities to take checkpoints. Over time, they are always working towards adding more checkpoints to tasks. Especially to tasks that do not presently have any. However, they are also adding over time more new types of tasks. So, each type actually has to endure and mature a little, before it becomes clear whether it is going to show promising results. Only then can you know if it is worth further refinement and investment to add further coding to take checkpoints. So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!! The laptop in question has no battery. In this respect, it behaves more like a desktop computer that does not run 24x7. Other BOINC apps may or may not have the same characteristic. Seti work units are roughly 10x bigger, that is take roughly 10x longer to process on the same hardware than Rosetta work units and they have always been long work units to process. Checkpointing keeps us from losing Seti work, but not Rosetta. And since the laptop cannot type 'n' or 'b', creating a file named cc_config.xml is quite impossible. Should be able to get a new one tomorrow night. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!! I think that is being a little too dramatic. Each time I shut down the computer for whatever reason, I am always surprised how little time I have lost. With the units I have been given the lost time is usually less than 10 minutes. To get around your key problem plug in an old keyboard. It will solve your problem and also make typing easier . |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,677,569 RAC: 10,479 |
So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!! If the guys at the Bakerlab could improve the checkpointing then i'm sure they would - it's in their best interests! Unfortunately it might well be that it's just not practical without writing everything in memory to disk, which would have all kinds of implications for disk usage and shutdown times etc. Maybe it's an option that could be added... but without that it's an unfortunate aspect of the research being done. Ironically, faster computers are less affected by this problem within a given timeframe because they will reach checkpoints (even if that means they've had to complete a decoy to reach one) more quickly and therefore lose less computer time (well wall-time anyway). Also, I'm fairly sure many of the tasks do still checkpoint within a model, so you won't necessarily lose all of the work on a given decoy if you restart. |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
{...} In a word: no. At the time of this posting, the last nine workunits reported by your computer all consisted of one model. The average time to complete that one model was 5999 seconds (about an hour and forty minutes). For that computer, if you leave it on for less than an hour and forty minutes at a stretch, there is a chance you will lose all the work. {...} If you change computers your results will likely change also. I hope that the new computer will allow you to work more comfortably and efficiently, and that (if you choose to continue) it will allow you to complete Rosetta workunits with less frustration. Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Robert Gammon Send message Joined: 9 Nov 07 Posts: 14 Credit: 969,848 RAC: 0 |
So the message REALLY is, DO NOT SHUTDOWN A COMPUTER THAT IS WORKING ON ROSETTA. TO LOSE POWER IN ANY FORM OR FASHION, AND THE LIKELIHOOD IS VERY VERY HIGH THAT YOU WILL LOSE ALL WORK ON THAT WORKUNIT!!! =========================================== My losses are measured to as much as 4 hours, 1/6th of hour is great if i could get that. old keyoard is ot availale ad i have o trasport or cah to uy oe |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Robert, I realize that there are times that 4 hours pass without a checkpoint to preserve the work. I just wanted you, and others that read this, to realize that there are a number of factors involved, and that the majority of the time, Rosetta does significantly better at checkpointing. Rosetta Moderator: Mod.Sense |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
old keyoard is ot availale ad i have o trasport or cah to uy oe Another way around, but not very convenient is to use the screen keyboard. It is a bit slower as you have to use the mouse pointer. You can find it in programs > accessories > accessibility. |
Message boards :
Number crunching :
Rosetta Checks
©2024 University of Washington
https://www.bakerlab.org