Message boards : Number crunching : Silly Newbie Tricks - Suspending a work unit
Author | Message |
---|---|
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
They make things "idiot proof" so that only an idiot can foul them up... I know the answer is obvious, but my real-world experiences doesn't seem to match up. My laptop is up and running Rosetta now, and I'm able to let it run maybe 6-8 hours a day during the week, before powering down. I am running small work units (2 hours), so I'm not losing a great deal of crunching, but an hour here and an hour there add up. I tried three different methods, and all failed to allow me to continue with a work unit when powering back up. Rather, the work unit starts from scratch. (1) Using "Suspend" button from "Tasks" tab. (2) Using "Suspend" button from "projects" tab. (3) Just shutting down Win-doze from the "Start" button. I "assume" any of these three should have allowed me to "Resume". How often are checkpoints created? If every hour, then I guess its possible that it would have to start from beginning. I know this is an eye-dee-ten-tea ("ID10T") enduser-error, so any directions to the path of enlightenment will be sincerely appreciated! |
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
Here's what I do with my laptop:
|
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
If it's restarting the WUs, it may not have reached a checkpoint. Yes, upon completion of a model, a checkpoint is made. And checkpoints may be made within a model as well. It varies by work unit as to where checkpoints in mid-model are possible. Some take more then an hour to reach a point in the calculations where it is possible to take a checkpoint. So, on such a protein, if you were only 50 minutes in to the first model, and turn off your machine, when you restart, you will have to start from the beginning. This should be fairly rare if you're machine is on for the 6-8hrs at a time you describe. My understanding is that you will see a bump in the % complete when a mid-model checkpoint has been made. These checkpoints are the fractional portion of the % complete. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
With BOINC 4.x I could suspend the project then exit BOINC and a checkpoint would be forced so that on restart the current model would begin where it left off. On BOINC 5.4.9 I've noticed that this doesn't always work. Seems to be a flakey bug in it. |
Scott14o Send message Joined: 7 Apr 06 Posts: 24 Credit: 2,147,598 RAC: 0 |
I too, find it annoying that the check points are only after every model. My computer isn't the fastest so it sometimes takes awhile on each model, it would be nice to know that I can shut down my computer for the night and know that the hour and a half work that it had already done wasn't wasted. Are there plans for there allow it to have more checkpoints? |
Avi Send message Joined: 2 Aug 06 Posts: 58 Credit: 95,619 RAC: 0 |
With BOINC 4.x I could suspend the project then exit BOINC and a checkpoint would be forced so that on restart the current model would begin where it left off. On BOINC 5.4.9 I've noticed that this doesn't always work. Seems to be a flakey bug in it. I recall reading that at certain points, there are above 300mb of data that needs to be stored. When I shut my laptop, I usually hibernate. Then I have no fear of missing out from a checkpoint in rosetta, AND the laptop starts up much faster afterwards. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Keith With BOINC 4.x I could suspend the project then exit BOINC and a checkpoint would be forced so that on restart the current model would begin where it left off. On BOINC 5.4.9 I've noticed that this doesn't always work. Seems to be a flakey bug in it. I don't think you are correct on that. BOINC has no way to force applications to perform a checkpoint. Checkpointing (and the lack thereof) has been a problem for many BOINC projects. Scott Are there plans for there allow it to have more checkpoints? Rosetta did add the mid-model checkpoints. And the team seems aware that additional checkpoints, especially for the larger proteins which take longer for each model, and to reach the mid-model checkpoints is desireable if possible. They end up in a catch-22 situation where if they checkpoint too frequently, they are consuming your machine resources in performing the checkpoints, rather than doing the science. If they don't checkpoint frequently enough, they end up losing sometimes significant amounts of the science work that has been done. It's a fine line to walk. The good news is that with the "watchdog" they've stuck a balance that introduces a failsafe mechanism that ends the WU for you if the combination of the type of WU and the relative speed of the machine or time it is taking to reach checkpoint is causing a specific WU not to make progress. This sort of puts a cap on environments and combinations that are losing significant crunch time and not reaching checkpoints where the work is preserved. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
John McLeod VII Send message Joined: 17 Sep 05 Posts: 108 Credit: 195,137 RAC: 0 |
With BOINC 4.x I could suspend the project then exit BOINC and a checkpoint would be forced so that on restart the current model would begin where it left off. On BOINC 5.4.9 I've noticed that this doesn't always work. Seems to be a flakey bug in it. I don't think you are correct on that. BOINC has no way to force applications to perform a checkpoint. Checkpointing (and the lack thereof) has been a problem for many BOINC projects. [/quote] This is correct. Whenever a project application wishes to checkpoint it asks the BOINC client if it is time yet. If it is time for a checkpoint, the project checkpoints, and if it is not, the project is not supposed to checkpoing. There are a couple of projects that ignore this CPDN checkpoints once every 5 to 60 minutes ignoring the checkpoint timer, and it is common for the first cut of checkpointing in Alpha level projects to miss this detail. The most recent one of these checkpointed about 5 times per second on a fast machine. BOINC WIKI |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi John. You might be able to tell if what i'm seeing is a Rosetta or Boinc problem. I'm running Boinc alpha 5.5.13 because of the problems with the two Seti's, I have just joined rosetta and i have my app's switching every 2hrs now Seti premeepts O.K. but Rosetta has kept going, till it finishs the first couple of W.U.'s today it got new work and it premeepted at about 2.5hr's and there is nothing in the messages about it, Only that Seti has started. Any ideas. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Hey Peter, welcome to Rosetta! I've heard something about a feature coming soon in BOINC where it would preempt at checkpoints. Perhaps that's in your alpha version? So, after 2hrs of crunching Rosetta, it tapped that Rosetta WU on the shoulder and said "hey we'd like you to pack up for a bit" and it took the WU another half hour to reach a checkpoint, at which time BOINC rescheduled the CPUs. Does that sound like what happened? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi Feet1st. Possable but i just came home and now Seti has keept running to it's over 2.5hrs and still going, I geuss till it finish's! I will see what happens might have to go back to 5.4.9. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Hi Feet1st. Finally! I think this is a very good feature. I was tired to see BOINC reschedule when the old WU was almost done. I think it is much better to finish the current WU when it's near completion and than to reschedule than to reschedule every 2 hrs no matter whether there was a recent checkpoint or the WU was almost done. Don't you agree? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Yes, let the rest of the debt system etc. work out the details down the road. My understanding is it waits until a checkpoint is reached. So, may not be a completed model or completed WU... but means that no work is lost, even if you aren't keeping in memory or turn off the machine! Simple way to extract maybe 5% more useful work out of the existing machines. Depends up often you end BOINC or were losing work that hadn't been checkpointed. If anyone has a link to the details of this upcoming BOINC feature, please post a link. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
John McLeod VII Send message Joined: 17 Sep 05 Posts: 108 Credit: 195,137 RAC: 0 |
Yes, let the rest of the debt system etc. work out the details down the road. My understanding is it waits until a checkpoint is reached. So, may not be a completed model or completed WU... but means that no work is lost, even if you aren't keeping in memory or turn off the machine! The 5.5 CPU scheduler waits for the next checkpoint later than 10 seconds before the check (there is some asynchronous code, and several seconds can disappear if the host is slow and busy) unless there is a task the needs extra CPU time to complete on time. This may suspend a task just a few seconds before it is complete if there is a checkpoint there, but normally a checkpoint will only happen once every few minutes. Problems that had to be dealt with: tasks that run for days without checkpointing (there are projects that do this), projects that lie about how much work is left (one project I remember had tasks that had a 100 hours or so of CPU time after 100% complete was reached on some tasks). 5.5.13 also implements work fetch that does not fetch a full queue from each project, and keeps the queue full even if there is a risk of late work. The user has indicated that the CPU would probably be idle if there was not enough work to keep it busy. BOINC WIKI |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,465,996 RAC: 6,049 |
[quote] Is rosetta@home one of these? This morning, after about 5 hours, the boincmgr indicated that rosetta@home reached 100% complete, yet it has been running about 10 hours since then. And really running, not stalled. I am running 5.8.16 of the BOINC client and boincmgr. rosetta_5.69_i686-pc-linux-gnu is the program itself. This is a Red Hat Enterprise Linux 5 system with two 3.06 GHz hyperthreaded Xeon processors and 8 GBytes RAM. $ ps -fu boinc UID PID PPID C STIME TTY TIME CMD boinc 2420 4627 86 03:52 ? 15:04:04 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c boinc 2421 2420 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c boinc 2422 2421 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c boinc 2423 2421 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c boinc 4627 4625 0 Sep29 ? 00:11:16 /home/boinc/BOINC/boinc |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
[quote] I assume you are asking about the comment I've bolded? ...not to my knowledge. I believe the odd symptoms people are seeing on Linux all relate to tasks which show they are not yet completed, but BOINC has requested that they stop crunching and it has scheduled another task, but the Rosetta thread continues working... working what would otherwise be normally. As in it will finish at a normal time... just that it shouldn't still be running. Rosetta Moderator: Mod.Sense |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,465,996 RAC: 6,049 |
[quote] You assume correctly. Most rosetta work units seem to complete in 5 to 8 hours for me. This one announced it was 100% complete and had no time remaining at about 5 hours, but it has now run up 22 hours 17 minutes. According to "top" command, it has consumed 1338:07 (minutes:seconds) time. If I knew it was running something important, I would just let it run, but most of this time has run up after boincmgr announced the process was complete. Also I do not understand the excess rosetta processes. PID PPID USER PR NI S VIRT RES SHR SWAP %MEM %CPU TIME+ P COMMAND 2420 4627 boinc 39 19 R 56500 45m 20 9632 0.6 74 1342:07 0 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1 4629 4627 boinc 34 19 S 35760 5900 3148 29m 0.1 0 1:07.95 0 hadcm3trans_5.41_i686-pc-linux-gnu hadcm3inct_cmus_1920_160_65869824 1085_ocean.year yafbg 2421 2420 boinc 34 19 S 56500 45m 20 9632 0.6 0 0:00.13 2 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1 2422 2421 boinc 34 19 S 56500 45m 20 9632 0.6 0 0:00.51 1 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1 2423 2421 boinc 35 19 S 56500 45m 20 9632 0.6 0 0:00.04 2 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I guess I would suggest ending BOINC and restarting. The "excess" processes could be due to BOINC going to a "waiting for memory" state. It then starts up another process and crunches on that until memory again cross above your preference. I see you have 4 cores and 8GB of memory. Do your BOINC General Preferences allow it to use at least 25% of that? For both idle and while active? Rosetta Moderator: Mod.Sense |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,465,996 RAC: 6,049 |
I guess I would suggest ending BOINC and restarting. I do not see why my machine would have any trouble getting memory for a BOINC application. I have 8 GBytes RAM and allow 75% of it to BOINC when the machine is busy (whatever that means) and 95% when the machine is not busy. Typically, 75% of the RAM is devoted to the input cache, although that can go down somewhat when I run a postgreSQL database application. I tried stopping BOINC and everything stopped except for the rosetta programs that kept running. The one with all the time on it was the parent of the other three. I killed them and restarted BOINC and all seems to be running normally. I assume I lost 30 hours credit for that mess. |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,465,996 RAC: 6,049 |
I guess I would suggest ending BOINC and restarting. Progress report, sort-of. I probably did not lose any credit, at least as yet. After the boinc client scheduler got around to it, it resumed that 100% progress work unit again and it ran quite a few hours more. Then it started another part of the same work unit (same line in boincmgr), reset the time run to 0, but still indicating 100% progress with no time remaining. Since then it has run up more than 37 hours. I propose to let it run another day or so and see what happens. |
Message boards :
Number crunching :
Silly Newbie Tricks - Suspending a work unit
©2024 University of Washington
https://www.bakerlab.org