Message boards : Number crunching : Tasks end prematurely.
Author | Message |
---|---|
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
I've had a number of tasks recently that will end prematurely if the computer has been switched off, or if the task or Rosetta has been suspended. The task(s) resume, then end and upload within 15 seconds or so. In some cases I've deliberately waited until after a checkpoint before switching off; no difference. The tasks are well short of predicted run times. |
Chris Send message Joined: 11 Dec 07 Posts: 4 Credit: 12,573,597 RAC: 0 |
I have seen the same thing |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Please post links to the tasks in question. Some tasks end before your runtime preference. Some can end after. That would be normal. Is it possible that interrupting the task was coincidence? Did you happen to see the graphic display prior to interrupting it? Was it proceeding normally? Rosetta Moderator: Mod.Sense |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
I encountered this situation myself. I observed a task checkpoint and the Rosetta application continued processing rather than terminate and report. That signified to me that the application determined there was sufficient time left to try and create another model. However, I proceeded to suspend the task and exit BOINC (reason unimportant). When I restarted the task it immediately terminated the task and reported. I concluded that upon restart Rosetta didn't have all the same information that it had before suspension, so it concluded it should not attempt another model. Perhaps though there is a bug. |
Koen Send message Joined: 29 Sep 05 Posts: 8 Credit: 8,542,574 RAC: 0 |
It happens when the task has run for more than half your target CPU-time. On restart this task ends and reports only 1 decoy, as seen in task details ( Example), but receives what seems to be the right amount of credits. My guess: At restart Rosetta Mini (only occurs on Mini-tasks) reports the exact runtime to Boinc but only 1 decoy. If this runtime is past half way your target time, Boinc decides there isn't enough time to complete another decoy. If needed I can try to describe this a little better, my English is a bit rusty... Koen |
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
Please post links to the tasks in question. Some tasks end before your runtime preference. Some can end after. That would be normal. Is it possible that interrupting the task was coincidence? Did you happen to see the graphic display prior to interrupting it? Was it proceeding normally? Just looking at some of my past results, these look like two examples: Task ID 186604373 Name abinitio_only62_A_1ew4A_4434_7211_0 Workunit 170438591 Task ID 186604330 Name abinitio_homfrag_71_A_1gu3A_4443_2760_0 Workunit 170438511 Everything was proceeding normally. This is a fairly new behaviour. |
Jipsu Send message Joined: 27 Jan 08 Posts: 10 Credit: 454,555 RAC: 0 |
I have noticed this too. Mini is running normally and graphics show that it has produced some decoys, then if I restart my computer it will report the task as completed and only 1 decoy or validate error. Strange. I just shouldn't reboot the computer I quess. :D This is the latest task Validate error. |
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
Please post links to the tasks in question. Some tasks end before your runtime preference. Some can end after. That would be normal. Is it possible that interrupting the task was coincidence? Did you happen to see the graphic display prior to interrupting it? Was it proceeding normally? ... and another one -- 26/08/2008 3:44:31 PM|rosetta@home|Restarting task abinitio_only62_A_256bA_4438_6378_0 using minirosetta version 132 26/08/2008 3:45:01 PM|rosetta@home|Computation for task abinitio_only62_A_256bA_4438_6378_0 finished after 2 hours and 7 minutes of run time, when 2 hours 40 minutes was predicted. Lindsay |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Please post links to the tasks in question. Some tasks end before your runtime preference. Some can end after. That would be normal. Is it possible that interrupting the task was coincidence? Did you happen to see the graphic display prior to interrupting it? Was it proceeding normally? What are you actual run time settings either in BOINC manager or on the Rosetta account page? 4 hours or what? predicted run times are not always accurate and can vary widely from way under the preferred run time in your settings to up to 30 minutes or less under those settings, if the program feels it can not run another model/decoy in the time you gave it run. |
DaBrat and DaBear Send message Joined: 9 Aug 08 Posts: 16 Credit: 213,180 RAC: 0 |
Not sure if this is issue or not. I have watched several tasks get exactly down to the 10 min marak and then terminate and report. Mini as well. The tasks always start wit a long completion time when I attached my comp. Even if the completion time was 6 hours to begin with, the task will sometimes hit 2:40 or less and report... not sure what that means |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Not sure if this is issue or not. I have watched several tasks get exactly down to the 10 min marak and then terminate and report. Mini as well. The tasks always start wit a long completion time when I attached my comp. Even if the completion time was 6 hours to begin with, the task will sometimes hit 2:40 or less and report... not sure what that means look at your task results reports for those tasks, the stderr text will tell you if there are any errors. also your credit will tell you if there were problems as well. the most common case for Boinc to terminate the job is that it feels it can not complete the next decoy in time to meet your preference for run time. other causes are errors in that task's code. as for the stopping at 10 minutes to completion, the run times are estimates. I don't know how to explain how Boinc get's its time to completion updated, but I think it has to do with how many decoys were completed and how many more it estimates it can complete before running out of time. you say run for 6 hrs and the program runs 5hrs and 50 mins and then stops, that is within the run time you stated. rosetta completed the last model at that point and with 10 minutes to your deadline it can not do anymore work. someone else can refine my statements to be more precise. |
DaBrat and DaBear Send message Joined: 9 Aug 08 Posts: 16 Credit: 213,180 RAC: 0 |
Nah usually rosie downloads task to my computer when I first attached with ridiculously long completion times such as 6.5 hours..... after a while they even out to about 2:50 a WU. It will show them all tat this completion time or less but no matter what the estimaed completion time... it ends and exits prematurely. Creids are great usually more than I request when reported and no errors showing |
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
In my prefs. Target CPU run time is "Not Selected". My tasks typically run between 2:30 and 3:00 hours. Both Rosetta and I have been doing this long enough to know what to expect. I believe this is NEW behaviour. |
DaBrat and DaBear Send message Joined: 9 Aug 08 Posts: 16 Credit: 213,180 RAC: 0 |
LOL neither is mine and this has nothing to do with that selection. If it is not selected it defaults to three hours. Even though the first batch that loaded to my comp said 6 hours for some odd reason. Whenever a new task is downloaded to your comp, it has an estimated time of completion I assume based on your CPU benchmarks. I believe the previous poster's situation is the same as mine. At 10 mins before completion, even thought the task may not read 99% complete, it simply completes itself and uploads. Usually on mine it reads about 95% complete. No errors, no nothing. As a matter of fact I usually wind up with more credit granted than claimed asnd this behaviour seems to be specific to the minis. At least in my case. The task can have an estimated completion time of 2:50 but will end at 2:18 - 2:44 at less than 98% showing. Maybe it is BOINC reading that it doesnt have time for another decoy in the time remaining. |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
This thread has gotten a bit confusing, if not for the posters here then for at least one lurker. I thought I would try summarizing both Rosie's and Boinc's expected behavior. Hopefully Mod.Sense, a project admin, and/or another cruncher will clarify anything I get muddled. An example ... Rosetta sends tasks with estimated runtime of 3 hours (the default), completes many close to but always under 3 hours (respecting the preferred, in this case default, runtime) occasionally significantly under (large protein/complicated task=1 model taking over half the preferred/default runtime) resulting in BOINC estimated "to completion" time of 2:50. Most tasks will finish in less time but any reaching this time will show approximately 10 minutes to go and 95% progress until it finishes regardless of how long that actually takes. Now for the somewhat long explanation. The very first time you download a task the "to completion" time showing in the BOINC manager is the same as the estimated time sent by the project application. Your computer may take more or less time to complete the workunit than the estimate given by the app. For future tasks BOINC has a formula for calculating the "to completion" time which takes into account the estimate received from the project and what BOINC has learned about how efficiently your particular computer runs this particular project. What BOINC has learned is reflected in your duration correction factor or DCF which you can see at the bottom of your project account page. Generally speaking, after you have run several tasks BOINC will be able to estimate the time to completion quite accurately. Unfortunately Rosetta has a few unique characteristics which prevent BOINC from ever settling into a stable DCF. Unlike other projects Rosetta allows the volunteer to set their own preferred runtime. If you don't change this manually I believe the current default is 3 hours. Whatever it is, this mechanism means that every task you will receive, no matter how large a protein or how complicated a set of calculations (in other words, no matter how long it takes to run a single model), will come to you with the same estimated runtime. Rosetta tries to respect this runtime preference by calling the app to finish if it thinks you cannot complete another model before reaching your preferred(or the default) runtime. BOINC doesn't know this however, and interprets the early end as an indication that your computer runs Rosetta more efficiently than Rosetta's estimate would indicate. It adjusts your DCF and the "to completion" time of any subsequent workunits downward. Downward adjustments are always quite small (it's not a second for second change, a percentage perhaps?) but generally speaking most of us will see a "to completion" time somewhat less than the preferred or default runtime. Most of the time. We won't when the other unique characteristic of Rosetta comes into play. Rosetta wants to make sure they get some useful data back from these workunits and so they have also tell the app not to quit until it completes at least one model even if that takes up to 4x longer than the preferred or default runtime. A particularly large protein or perhaps a complicated docking task which takes longer than the estimated runtime will result in an increase in your DCF and subsequent "to completion" times. It also causes the workunit to appear stuck at 10 minutes to completion/95% progress. On some projects the progress meter seems to reflect stages completed within the application(and/or checkpoints) but I believe on Rosetta it is a reflection only of time spent compared to the preferred or default runtime. Once the task goes beyond the estimated time BOINC has no way of knowing how much longer it will take and thus appears stuck at approximately the 10minutes/95% mark. I must acknowledge that none of this answers the original poster's question about possible premature exits. It simply shows that some of the evidence given is not sufficient to prove such an event has happened and is in fact expected behavior. Both l mckeon and Nothing But Idle Time have mentioned checkpoints. I don't think you can identify Rosetta checkpoints in the BOINC manager but I seem to recall someone identifying the files where they are located. Is this right? How much time was there between seeing the checkpoint and shutting down? Did you look at the graphics to see if it in fact had started a new model after that checkpoint? Maybe it was doing the calculation to determine if it could complete another model when you shut it down? Could the deadline have come into play? Okay, that feels like quite a long shot but I am curious as to exactly how the decision to quit or continue is made. Even within a task models can take different amounts of time to complete so after the second does the app look at the average or the longest or something else? Does the deadline come into play? Snags Spotted this on Mike Hewson's posts (moderator on Einstein): "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal. My apologies for being so long-winded. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Thanks snags! I was wondering when someone would notice there are at least 2 or 3 different issues being discussed now. I would only revise your example to point out that tasks can and do run over the preferred runtime. Especially when the preference is <4hrs. But it is more common for tasks to end early, then it is for them to end late. This is because most tasks are running models that take less then an hour to complete. Some every once and a while, you will see a task that takes 4-6hrs to complete a single model. These will generally show estimated time to complete of 10 minutes, give or take a minute, during any runtime passed the preference. You can identify checkpoints on the client if you enable debug of checkpoints in your cc_config.xml <checkpoint_debug>1</checkpoint_debug> info in BOINC wiki Rosetta Moderator: Mod.Sense |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
Argh! Thanks for the correction Mod.Sense. That "always under 3 hours" should be "almost always under 3 hours". I really should have caught that as my older ppc mac probably runs over its preferred runtime more than most.(Yep, just checked, my DCF is 1.18) I was thinking about this thread as I took a walk early this morning and by the time I got to the keyboard I was pretty focused on explaining why the "to completion" time doesn't match the estimate given by the project. On most if not all other projects this estimate is invisible to the cruncher and it's appearance/use here on Rosetta as the "target CPU run time" seems to lead to a fair amount of confusion. If you know about BOINC but not Rosetta or vice versa you could easily think something was wrong when in fact everything was operating just as it should. WCG has some projects that run similar types of tasks in that they run as many models as they can within a set period of time. WCG has chosen 8 as the magic number and so far I have never seen one of those tasks run less than eight hours or more than a minute or so over eight hours. So either the runtime for each model is very very very short or the app calls for a finish at the eight hour mark regardless of the progress of the last model. I couldn't swear to it but I would bet on the latter explanation. If BOINC thinks the task will take eight hours and it takes eight hours the DCF will remain 1 and the "to completion" time (showing in the BOINC manager) for the next task will match the estimate (invisible to the cruncher) given to BOINC by the app. The "CPU time" and "to completion" columns will always(ur, maybe I should say almost always!) add up to eight and the percent in the "progress" column will (should) always make sense. But the trade off is the wasted cpu time spent working on that last model before being forced to abandon it by a strict task time limit. Given the many different types of tasks here on Rosetta, and the considerable variation in runtime per model I would guess that a strict task time limit here would result in a lot of wasted cpu time. Personally I prefer coping with the variation. Thanks for the info on checkpoints. I've always been scared of poking about the BOINC files but I think I'd like to try this. Give me a week or so, to search through several forums for everything that could go wrong and then to build up my courage to try it anyway:) Snags |
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
When BOINC tasks save a checkpoint they have to put the data somewhere, and I believe it's the SLOTS sub-directories/folders. They are called are called Slots/0, Slots/1, etc depending on the number of current tasks and projects in memory. You can see it clearly on the World Community Grid project on rice; it updates every couple of minutes and so does the corresponding slots directory. These directories are deleted when all tasks are finished and uploaded. I'll get a new batch of Rosetta tasks and watch things more closely. |
DaBrat and DaBear Send message Joined: 9 Aug 08 Posts: 16 Credit: 213,180 RAC: 0 |
Argh! Thanks for the correction Mod.Sense. That "always under 3 hours" should be "almost always under 3 hours". I really should have caught that as my older ppc mac probably runs over its preferred runtime more than most.(Yep, just checked, my DCF is 1.18) I was thinking about this thread as I took a walk early this morning and by the time I got to the keyboard I was pretty focused on explaining why the "to completion" time doesn't match the estimate given by the project. On most if not all other projects this estimate is invisible to the cruncher and it's appearance/use here on Rosetta as the "target CPU run time" seems to lead to a fair amount of confusion. If you know about BOINC but not Rosetta or vice versa you could easily think something was wrong when in fact everything was operating just as it should. I responded to your somewhat long winded and eloquent post : here https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4213 |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
From DaBrat and DaBear in the other thread, "So I guess the best option would be to change my time preferences to 6 hours... that way they will all download with a 6 hour completion time amd if they complete in two I'll get a new WU somewhere in that window..." A WU with a 6 hour preferred runtime completing two hours would actually be an indication of a premature exit. The very first wu would arrive with expected runtime of six hours and BOINC would indicate this time in the "To completion" column before the task begins. Rosetta completes its first model and takes note of the time. If it took less than 3 hours to complete that model it will begin a second and will continue in this fashion until it thinks another model will overrun the 6 hour preference. If it took longer than three hours to complete one model it will end immediately. Rosetta will not check the time until it finishes that first model or until 4x the preferred runtime has lapsed. So the workunit with a 6 hour preferred runtime will in fact run any where from three hours to 24 hours. Whatever the number BOINC will take it, apply it's semi-secret formula and adjust the "to completion" times showing for every Rosetta unit in your queue. If you sent that first wu back after only 3 hours and 5 minutes then the time in the "to completion" column would now be reduced to slightly under 6 hours for all the remaining and subsequently acquired wus. If the first wu was sent back after 24 hours then BOINC would adjust the "to completion" time of all the remaining wus dramatically upward to 24 or very close to 24 hours. The important thing is those wus don't care about BOINC's estimated time to completion, they probably don't even know about it. They are going to do exactly what the first wu did and check the time after the first model. If it's under three hours they continue(provided you haven't gone to the website and changed your preferred runtime), if it's over three hours they quit regardless of the number BOINC puts in the "to completion" column. Whatever the number is, BOINC again does it's thing, applying it's formula to the time it took the second wu to run and adjusting the "to completion" times of the remaining onboard wus. And on and on. It is possible that you downloaded a bunch of identical tasks which will indeed all run over the preference and possibly lead to deadline trouble. Much more likely however you have a mix of tasks on board most of which will finish within the six hour preference. BOINC doesn't know this though so it goes into panic mode and runs the next wu on high priority and maybe the next one after that. Eventually the actual completion times it gets from Rosetta lead it to adjust it's own estimates sufficiently downward to let it out of panic mode. This panic may never happen to most people and is most likely to happen if you keep a large cache. It can also be caused if you adjust your preferred runtime upward in large chunks. Once again I have written an exceedingly long post. If it hasn't cleared things up for you, DaBrat and DaBear, I will comfort myself with the fantasy that it has triggered an aha! moment for a lurker:) Snags |
Message boards :
Number crunching :
Tasks end prematurely.
©2024 University of Washington
https://www.bakerlab.org