When is a wu stuck?

Author	Message
adrianxw Send message Joined: 18 Sep 05 Posts: 662 Credit: 12,167,519 RAC: 0	Message 38492 - Posted: 27 Mar 2007, 18:38:45 UTC Last modified: 27 Mar 2007, 18:46:02 UTC I have this wu running on one of my machines, and 2 others of the same type running on other machines which I have no access to at the moment. All 3 machine have 3 hour preference set. On this machine, the wu is still at 1% complete after >4.5 hours. I am well aware that the program will run at least 1 model, and that as a result, sometimes a wu will run longer then 3 hours. The thing is, I have no indication at all that this wu is doing anything. There are no files in the project directory getting updated. I am quite happy to let it run if it is doing anything positive of course. How long is reasonable to let this run? * EDIT * Sods law! It finished a couple of minutes after I typed the above after 4:36:00! <core_client_version>5.8.15</core_client_version> <![CDATA[ <stderr_txt> # random seed: 3342067 # cpu_run_time_pref: 10800 ====================================================== DONE :: 1 starting structures built 30 (nstruct) times This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> ]]> Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 38492 · Rating: 0 · rate: / Reply Quote

Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0	Message 38500 - Posted: 27 Mar 2007, 20:19:37 UTC I have a similar one running now. It has been going for about 2 hours 43 min and is at step 34000. ID: 38500 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5774 Credit: 6,139,760 RAC: 0	Message 38501 - Posted: 27 Mar 2007, 20:36:47 UTC Last modified: 27 Mar 2007, 20:38:17 UTC i've got that one queued in my system, but i will move it up ahead of 3 other wu's to see what it does. it should start running some time after 10am CET tomorrow. ID: 38501 · Rating: 0 · rate: / Reply Quote

B-Roy Send message Joined: 26 Sep 05 Posts: 26 Credit: 46,951 RAC: 0	Message 38520 - Posted: 28 Mar 2007, 12:48:32 UTC i just have one at 4:11 with Model: 1 and Step: 326500 (still showing 1%). I think that for slow crunchers like me, this is a potential problem considering that the wu does not checkpoint; due to this I lost 2h of crunching yesterday, when I turned of my PC with the same wu. ID: 38520 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 38524 - Posted: 28 Mar 2007, 13:05:15 UTC Last modified: 28 Mar 2007, 13:08:48 UTC B-roy, yes if your machine is on the slow side, it can take considerable time to reach the completion of that first model, at which time the task will have reportable results. Rosetta will then evaluate your runtime preference, and decide that is all your machine should crunch on that task. It will then skip to 100% completed and report in. I don't believe it is accurate to say that there are no checkpoints. However, your point about how it is possible to lose 2hrs of crunching is clear, and illustraits the need for more checkpoints. Rhiju posted just this weekend that they are evaluating how to best address this concern. There are other cases where improved checkpointing will help preserve completed work and increase the project TFLOPs. One such case is when someone runs several projects, as you are also probably doing as well. Rosetta Moderator: Mod.Sense ID: 38524 · Rating: 0 · rate: / Reply Quote

B-Roy Send message Joined: 26 Sep 05 Posts: 26 Credit: 46,951 RAC: 0	Message 38526 - Posted: 28 Mar 2007, 14:34:27 UTC thanks for the quick reply. Is there actually a fixed amount of steps for each model? I am at 446000 and counting, so I wonder whether I could preview a potential end, before having to shut-down the computer for the night again. ID: 38526 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 38530 - Posted: 28 Mar 2007, 16:22:23 UTC I cannot tell you an exact number of steps. It varies. 400,000 is in-line with normal for many tasks. In fact, it generally is very near the end. The main point is not to let the 1% indication throw you. It's not completed the first model yet, so it doesn't have the % complete calculated. Also, just be aware that if you didn't reach a checkpoint, and power off your machine for the day and have the same situation again, where it doesn't reach a checkpoint before you must power off... if this task does that 5 times, then Rosetta will end it and get another. Most tasks take less time then that to complete each model, and so the next task will then run better with how you are using your machine. So, it's all built-in to detect such a situation and to resolve it for you. Rosetta Moderator: Mod.Sense ID: 38530 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5774 Credit: 6,139,760 RAC: 0	Message 38532 - Posted: 28 Mar 2007, 18:06:02 UTC - in response to Message 38492. Last modified: 28 Mar 2007, 18:07:38 UTC see here for a similar named work unit. It completed on my computer with 3 decoys and 30 nstruct I had mine run 8 hours which i do for all WU's I have this wu running on one of my machines, and 2 others of the same type running on other machines which I have no access to at the moment. All 3 machine have 3 hour preference set. On this machine, the wu is still at 1% complete after >4.5 hours. I am well aware that the program will run at least 1 model, and that as a result, sometimes a wu will run longer then 3 hours. The thing is, I have no indication at all that this wu is doing anything. There are no files in the project directory getting updated. I am quite happy to let it run if it is doing anything positive of course. How long is reasonable to let this run? * EDIT * Sods law! It finished a couple of minutes after I typed the above after 4:36:00! <core_client_version>5.8.15</core_client_version> <![CDATA[ <stderr_txt> # random seed: 3342067 # cpu_run_time_pref: 10800 ====================================================== DONE :: 1 starting structures built 30 (nstruct) times This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> ]]> ID: 38532 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 38537 - Posted: 28 Mar 2007, 20:44:27 UTC - in response to Message 38532. Last modified: 28 Mar 2007, 20:45:37 UTC Hi everyone: Over on ralph, we are testing a new app (5.55) that updates the "percentage complete" in a more reasonable way. Basically, we are following suggestions posted by the users and forum moderators -- we are incrementing the percentage complete every second by an amount scaled so that 100% would correspond to 4 times the user's preferred CPU run time (this is the max time allowed for any workunit). Once each decoy is completed, the % complete is updated (usually jumps up) to a more accurate value! Hopefully this will help prevent some of the confusion for new users! [We are also working on more frequent checkpointing, but this turns out to be more challenging -- expect progress over then next two weeks.] Moderators, can you spread the news to the other threads where this question is being discussed? see here for a similar named work unit. It completed on my computer with 3 decoys and 30 nstruct I had mine run 8 hours which i do for all WU's I have this wu running on one of my machines, and 2 others of the same type running on other machines which I have no access to at the moment. All 3 machine have 3 hour preference set. On this machine, the wu is still at 1% complete after >4.5 hours. I am well aware that the program will run at least 1 model, and that as a result, sometimes a wu will run longer then 3 hours. The thing is, I have no indication at all that this wu is doing anything. There are no files in the project directory getting updated. I am quite happy to let it run if it is doing anything positive of course. How long is reasonable to let this run? * EDIT * Sods law! It finished a couple of minutes after I typed the above after 4:36:00! <core_client_version>5.8.15</core_client_version> <![CDATA[ <stderr_txt> # random seed: 3342067 # cpu_run_time_pref: 10800 ====================================================== DONE :: 1 starting structures built 30 (nstruct) times This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> ]]> ID: 38537 · Rating: 0 · rate: / Reply Quote