No restart points in minirosetta 3.46?

Message boards : Number crunching : No restart points in minirosetta 3.46?

To post messages, you must log in.

AuthorMessage
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 76080 - Posted: 1 Oct 2013, 8:51:56 UTC

Hi all,

I'm quite upset about the fact, that there seem to be no restart points in minirosetta 3.46 (or whatever these points are called).

I do the number crunching on my office laptop and on my home pc. On the office laptop I get tasks, that have an estimated run time of 7 hours. Since I also have other projects in BOINC than rosetta@home, you can easily figure out, that the effecive run time is far longer than 7 hours.

Now when I finish my work after 8 or 9 hours, I shutdown the office laptop. This leads to the fact, that all the hours spent for the calculation of the minirosetta project are wasted, because the task still has an estimated time left of 2 or 3 hours. The next day, the task starts again from scratch, and the task can never be completed.

Hey, what's the problem here? I remember that this was a point of discussions already 2 or 3 YEARS ago, and nothing happened? Isn't it possible to define some points where an intermediate result is saved, and from where a restart is possible? If not, then please make the WUs smaller. An estimated run time of 7 hours without having any intermediate saving point does not make any sense!

Look at Einstein@home, there it works. They sometimes have tasks that have an estimated time between 8 and 12 hours, and there, a restart is no problem. Maybe 1 hour may be lost, but there are improvements.

To be honest: I don't want to waste the energy and computer power for nothing.


ID: 76080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 76083 - Posted: 1 Oct 2013, 16:35:58 UTC

Change your target runtime preference to something shorter, like three hours.
ID: 76083 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 76084 - Posted: 1 Oct 2013, 18:12:33 UTC

The target runtime preference will not change when checkpoints are taken. The task's remaining runtime is not an indication of when the last checkpoint was taken either. But, if the task restarts at zero, that's a clear indication that a checkpoint has not yet been reached.

Rosetta tasks store their work (i.e. "checkpoint") at the end of each model. Some tasks compute models with less CPU time than others. Some tasks checkpoint more frequently than each model.

I run my laptop much as you describe yours. I use the sleep function to power it down for the day and when I bring it back up, the work continues. I've heard others indicate that their machine seems to only work as desired if they hibernate ratehr than sleep.
Rosetta Moderator: Mod.Sense
ID: 76084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 76085 - Posted: 1 Oct 2013, 22:00:31 UTC - in response to Message 76084.  

The target runtime preference will not change when checkpoints are taken. The task's remaining runtime is not an indication of when the last checkpoint was taken either. But, if the task restarts at zero, that's a clear indication that a checkpoint has not yet been reached.


For me, this indicates clearly, that there are NO checkpoints used. I made an experiment at home with the rosetta tasks, and even when I shutdown BOINC only a few minutes before the end of a rosetta task, this task restarts at ZERO, the next time BOINC is started.

So what's going on here??
ID: 76085 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 76086 - Posted: 1 Oct 2013, 23:11:39 UTC

You could just look at the properties of the task, and see the current CPU time as compared to the CPU time at the last checkpoint.

Your experiment only clearly indicates that there are some cases where checkpoints are not being taken in the time you have run the task. It doesn't give you cause to generalize beyond that.

As I said, various types of tasks have different behaviors. In general, the objective is that tasks take checkpoints every 10 or 20 minutes. When new protocols are being developed, there is often a high degree of inconsistency amongst runtimes between models. Such newer protocols typically send out only small numbers of tasks until such issues are addressed.

Rosetta has several mechanisms in place to ensure that your machine does not endlessly cycle through such tasks and situations. Tasks that consume more than 4 hours of CPU beyond their target runtime preference are ended for you by the "watchdog". Tasks that restart at the same point more than 4 times are ended for you. 4 times may seem like a lot, but consider how many times you reboot while installing updates and applications on your machine. Often with little time to reach a checkpoint between reboots.
Rosetta Moderator: Mod.Sense
ID: 76086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 76094 - Posted: 3 Oct 2013, 20:33:06 UTC

okay, here an example which gives a completely wrong value of the estimated run time, and which still has no checkpoints:

ac_t20s_reg_shift_6.0A_1pma_fit_INPUT_B0413-B0415_01_SAVE_ALL_OUT_IGNORE_THE_REST_100388_220

40000 GFLOPs
processor time at last checkpoint: -----
processor time 01:11:09
current run time 01:10:24
esimated run time left: 01:43:38 (thats nonsense)
progress: 16,938% (that means it will run at least for another 6,5 hours)

My computer is not so bad: Core-I7-2600K@3,40GHz, 8 GB, ATI Radeon 5850.

So whats wrong with this task?
ID: 76094 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 76100 - Posted: 4 Oct 2013, 21:45:02 UTC

If I hand you a map, with a location marked on it, a location you have never been to before, in fact noone has ever been there, and I ask you how long it will take to get there, how do you respond? You probably look at the mileage and the terrain and come up with an approximate figure. But what if you encounter road construction, washed out bridges, or areas where no road exists? Your estimate could be off by enough that in hindsight it looks like "nonsense". If I then assert that you are the wrong person for the job because you can't even give a valid estimate, you'd tell me you did what you could with the information you had available.

Well, when the Rosetta program goes out studying a protein, each model is unique. Noone has ever processed one exactly like it before. It's an estimate. For some tasks the estimates will be very close, for others... well... not so much. The BOINC Manager also confuses things because it attempts to learn how your machine's runtimes scale in to the tasks from the project. Especially given your recent intermittent results, BOINC Manager's correction factors are redured meaningless.

It looks like you have two machines attached to the project. One is going well with 8 CPUs and 8GB of memory, and the other is having problems, with 4CPUs and 3GB of memory. Are your runtime preferences for memory and keep tasks in memory consistent between the two machines? Is the typical number of hours per day the machine is on similar for the two?
Rosetta Moderator: Mod.Sense
ID: 76100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1995
Credit: 9,633,537
RAC: 7,232
Message 76118 - Posted: 11 Oct 2013, 13:21:25 UTC - in response to Message 76094.  

ac_t20s_reg_shift_6.0A_1pma_fit_INPUT_B0413-B0415_01_SAVE_ALL_OUT_IGNORE_THE_REST_100388_220

So whats wrong with this task?


All ac_t20s have problems with checkpoint and execution time (my default time is 2h, but this wus run 6h)
ID: 76118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,667,480
RAC: 10,750
Message 76123 - Posted: 12 Oct 2013, 15:31:58 UTC - in response to Message 76118.  

ac_t20s_reg_shift_6.0A_1pma_fit_INPUT_B0413-B0415_01_SAVE_ALL_OUT_IGNORE_THE_REST_100388_220

So whats wrong with this task?


All ac_t20s have problems with checkpoint and execution time (my default time is 2h, but this wus run 6h)

It's not really a problem with the tasks (checkpoints would obviously be nice but might not be possible) - 2h is a very short run-time so your computer probably isn't completing a single model in that time. Is it a probelm for you?
ID: 76123 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 76126 - Posted: 14 Oct 2013, 10:03:59 UTC

The time my two computers are "on" per day of course differs. The smaller one is my office one, and it's "on" for around 8h per day (but one hour from 11:30am to 12:30pm is taken away because during that time, updates and checks are done if requested by our hardware department), where the bigger one is the one at home. As you might think, during the week, this computer is "on" for a maximum of 4-5 hours per day. Only on weekends, this might be longer, but it differs.

Here is the next "t20" task, which yet has no checkpoint, and I would place a bet right now, that this again will be a task, which can never be completed on my office hardware:

ac_t20s_reg_shift_6.0A_1pma_fit_INPUT_A0119-A0128_-2_SAVE_ALL_OUT_IGNORE_THE_REST_100271_976_0

Actual status right now:
processor time at last checkpoint: ----
processor time: 01:47:11
current run time: 01:50:47
estimated time left: 04:03:47
progress: 25,518%

Regarding the estimated time left, this will never complete on my office computer, at least as long as there will be no checkpoints.

Maybe I need some help, but I did not find any possibility, to tell the projects, that I don't want to have tasks, that will run for longer than, let's say, 2 hours on my office laptop. Any hints? OK, maybe an optician could help, but right now I didn't find it.

Regarding you comment about the map, and to go where you never have been: nice try, but maybe you should ask your colleagues running EINSTEIN, why their estimations are perfect in 80%, and almost perfect for the rest??

But maybe, this is also an issue for the BOINC client, because if one second of run time is added to the value for "elapsed", this one second is simply subtracted from the value for "estimated rest". But there is no re-calculation of time done, based on the "progress" column.
ID: 76126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,667,480
RAC: 10,750
Message 76127 - Posted: 14 Oct 2013, 12:42:50 UTC - in response to Message 76126.  

Maybe I need some help, but I did not find any possibility, to tell the projects, that I don't want to have tasks, that will run for longer than, let's say, 2 hours on my office laptop. Any hints? OK, maybe an optician could help, but right now I didn't find it.

Can you hibernate your computer rather than turning it off? I believe that will allow the task to restart (although I'm not 100% sure that's the case).


Regarding you comment about the map, and to go where you never have been: nice try, but maybe you should ask your colleagues running EINSTEIN, why their estimations are perfect in 80%, and almost perfect for the rest??

It's a different project so the analogy might not be valid for that project but it is valid for Rosetta.
ID: 76127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 76128 - Posted: 14 Oct 2013, 14:25:20 UTC - in response to Message 76127.  

Can you hibernate your computer rather than turning it off? I believe that will allow the task to restart (although I'm not 100% sure that's the case).


Well... it's against our company policies, but I'll give it a try anyway.

Actual state of the named task:

processor time at last checkpoint: ----
processor time: 03:47:32
current run time: 03:52:24
estimated time left: 02:19:51
progress: 54,160%

I will plug in the laptop today at home, then I see immediately if hibernating really helps. But in general, since other rosetta tasks do have checkpoints, I'm wondering why it shouldn't be possible to have them for the ac_t20s-Tasks.

ID: 76128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 76129 - Posted: 14 Oct 2013, 21:01:39 UTC - in response to Message 76128.  

Okay, here is the result: yes, hibernating would work. The task continues without restarting.

But now the funny thing: it almost ran for 8 hours (estimation: 4 hours). And in the messages, I suddenly can read:

Task ac_t20s_reg_shift_6.0A_1pma_fit_INPUT_A0119-A0128_-2_SAVE_ALL_OUT_IGNORE_THE_REST_100271_976_0 exited with zero status but no 'finished' file. If this happens repeatedly, you may need to reset the project.

Nothing to read in the messages window about uploading the result or anything else, so all the time was WASTED.

Hey, I DID a project reset some months ago, because this message is not new to me. And not all rosetta tasks show that message, so there must be something else going wrong with some of those tasks. I will have a very close look, which tasks will give the same results: long calculation time with no checkpoints, and then producing no 'finished' file.

So the next time I see a "t20s"-task, i will consider stopping that task immediately here. Good night, over and out :-(
ID: 76129 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : No restart points in minirosetta 3.46?



©2024 University of Washington
https://www.bakerlab.org