Problems with some CAPRI15 WU's

Author	Message
Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52554 - Posted: 17 Apr 2008, 11:19:25 UTC Task FRA_t038_CAPRI15_1n82_1_IGNORE_THE_RESTt038_1_input.pdb_3060_216104 I have my target runtime set to 1 hour (Athlon XP 2200+). I know that Rosetta will alway complete at least 1 "decoy". My current task started over 3.5 hours ago by wall clock time. 3hr 27min according to BOINC Manager. Steps are still incrementing on the "Show Graphics" screen currently at 56804. Looked in the BOINC "slot" folder and the last checkpoint file was written over 2.5 hours ago, 56 minutes after the task started. What would happen if I closed down BOINC, or a power failure happened? ID: 52554 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52555 - Posted: 17 Apr 2008, 12:02:00 UTC Update on this task. CPU usage just dropped to 0%. Looked in the slot folder and rosetta_random and rosetta_decoy_cnt have updated about 4 minutes ago. "Show Graphics" button now does nothing but the button is not "greyed out". BOINC Manager progress has stopped at 3:55:20, Progress stopped at 95.924% Going to shut down BOINC as no other tasks are running. ID: 52555 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52556 - Posted: 17 Apr 2008, 12:24:04 UTC When I shut down BOINC after 15 minutes of no activity the following files were written or updated: stdout.txt stderr.txt farlxcheck I then restarted BOINC and a SETI task restarted. Rosetta is now on STD of -11428 seconds, so I will have to wait a few hours for this task to restart unless I suspend all my other projects. I have also made a copy of the slot folder in case this task errors out when it re-starts. Keith ID: 52556 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 52563 - Posted: 17 Apr 2008, 14:25:28 UTC I've noticed some of the work recently is taking as long as 5 or 6 hours to complete a single model. A given task will run for one complete model, then poke his head out, look around and see what time it is. If he's got enough time to do another model, based on the time it took for the first, and finish near your target runtime, then it begins a second model. Otherwise it marks it as completed and sends in the results. The watchdog will presume something is seriously wrong if a task runs for 4 times longer then your runtime preference. Unfortunately, if you have a 1 hour runtime preference, that might cause the watchdog to terminate work that is actually proceeding well, and if your runtime is 24hrs, it might take the watchdog days longer then needed to determine a task isn't progressing normally. The watchdog also watches the current score of your model. It should fluctuate over time. If it doesn't move for a 15 minute period of time, the watchdog will end the task as well. ...depending on what type of task you are running there, it may or may not have taken a checkpoint that preserved the work it had done so far. So, you may see the task begin from the start again when you restart it. 1hr is a very low target runtime, and there are many types of Rosetta tasks that take considerably longer then that to complete that first model. Because your target is so low, your time to completion estimates will often be rather dramatically off. And you will see tasks run through to about 10 minutes remaining, and then seem to stall. They're still running fine, but if time keeps marching on at normal speed, you would go to a negative time remaining, and that's not right either, so they take that last 10 minutes and basically make it take forever. The time remaining is just a guess. The task will finish when it's done... whether or not 10minutes remaining was showing at the time. With such a low runtime, and the variability in the actual time your BOINC client observes work taking, it will have considerable trouble properly estimating how much work to get to meet your cache objectives. A longer runtime will make for better estimates and scheduling all around. This is changed in your Rosetta Preferences. Just click the "[Participants]" link above here on the message board to review your settings. Rosetta Moderator: Mod.Sense ID: 52563 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52568 - Posted: 17 Apr 2008, 17:35:56 UTC Last modified: 17 Apr 2008, 17:38:05 UTC Thanks for the reply, I was aware that 1 hour can be too short for some models/decoys. I reduced it to 1 hour a few weeks ago when RALPH started getting a lot more work than usual and I also added a few other projects. Rosetta currently has 8% resource share as I run a lot of projects on 1 PC. Some of these like LHC, SIMAP, and RALPH often have no work, so the true resource share is usually closer to 10 or 15% BOINC always manages to return tasks in time, but I dropped the target time for Rosetta when things were getting tight a few weeks ago. I read about the Watchdog time elsewhere while I was waiting for a reply. Rosetta is still at -9200 STD so I will probably give it an artificial boost in the next few hours as I want to see what this task will do today. ID: 52568 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52570 - Posted: 17 Apr 2008, 18:45:42 UTC The task ran for another 5 minutes when it re-started then errored out. The reported time is 314 seconds, but it really ran 3hr 55min longer than that. I have kept a copy of the slot folder when it stalled earlier. If the developers would like a Zip of this, I can send it. I don't supose there is any chance of getting a credit adjustment for this task. ID: 52570 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52587 - Posted: 18 Apr 2008, 11:43:51 UTC Last modified: 18 Apr 2008, 11:50:07 UTC Going to change the title of this to make it more meaningful from "How far past target runtime will it go?" to "Problems with CAPRI15 WU's" I have another CAPRI15 WU FRA_t038_CAPRI15_1n82_1_IGNORE_THE_RESTt038_1_input.pdb_3060_235298 which has also gone well past 3 hours without checkpointing. It is currently preempted by other tasks that have gone into "high-priority", but it is still in memory. ID: 52587 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 52588 - Posted: 18 Apr 2008, 12:20:02 UTC I've tried to explain that, from what you describe, the activity you are seeing sounds normal. And I always explain that some types on tasks do not checkpoint as frequently as others. In fact, some only checkpoint when a model is completed... which can be anywhere from 5 minutes to 5 hours. Just depends what type of task it is. Given your change in title, I wanted to clarify your "problems with...". Is the only problem you are observing that they take a long time to checkpoint, and to complete their first model (which in your case will take longer then your runtime preference, and cause them to report in)? Or are there other things you are seeing unique to the CAPRIs? Rosetta Moderator: Mod.Sense ID: 52588 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52591 - Posted: 18 Apr 2008, 12:38:16 UTC - in response to Message 52588. I've tried to explain that, from what you describe, the activity you are seeing sounds normal. And I always explain that some types on tasks do not checkpoint as frequently as others. In fact, some only checkpoint when a model is completed... which can be anywhere from 5 minutes to 5 hours. Just depends what type of task it is. Given your change in title, I wanted to clarify your "problems with...". Is the only problem you are observing that they take a long time to checkpoint, and to complete their first model (which in your case will take longer then your runtime preference, and cause them to report in)? Or are there other things you are seeing unique to the CAPRIs? OK, I know I have my run time prefs set on the low side, but it is a valid setting. My PC has 768MB RAM, Windows XP, AMD Athlon XP 2200+ CPU. I usually run BOINC 24/7 with 75% memory usage while "in use" 100% while idle. Most other Rosetta WU's have created at least 1 decoy/model within 1.5 hours. I just checked back in my BoincView logs and I have had some sucess with previous CAPRI15 WU's. It just seems to be these last 2 that have gone on for much longer than normal without producing 1 decoy/model. I also took a brief look at the graphics window, and the molecule looks much more complex than usual. I will let the current model run as long as the CPU does not drop to 0% again, it is the last CAPRI15 that I currently have. ID: 52591 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52592 - Posted: 18 Apr 2008, 13:46:46 UTC FRA_t038_CAPRI15_1n82_1_IGNORE_THE_RESTt038_1_input.pdb_3060_235298_0 finally finished with Sucess. stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 3600 # random seed: 2869703 ******************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 17240.7 seconds. Greater than 4X preferred time: 3600 seconds ******************************************************************** GZIP SILENT FILE: .aat038.out </stderr_txt> ]]> Validate state Valid Claimed credit 46.1297831235163 Granted credit 80 ID: 52592 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,182,400 RAC: 6,174	Message 52593 - Posted: 18 Apr 2008, 13:58:23 UTC - in response to Message 52591. I also took a brief look at the graphics window, and the molecule looks much more complex than usual. CAPRI is protein-protein docking so the tasks are more complex than usual AFAIK. I think that either the minimum run time, or the watchdog multiplier before cancellation need to be reviewed on the project side. I assume the multiplier could be altered on a task by task basis (as it's a rosetta function rather than a boinc one)? Could all CAPRI tasks be changed to 6x??? ID: 52593 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 52594 - Posted: 18 Apr 2008, 15:04:30 UTC I had forgotten that it had dropped to 0% CPU at one time. From your description it sounds like that event would have been after 4 hours of runtime, and so would indicate the watchdog attempted, unsuccessfully, to shut down the task. So, I'll state that your... problem with CAPRI WUs is that the watchdog is not always able to properly end them when the runtime limits are exceeded on Win XP. ...from there you restarted from the beginning, because no checkpoint was reached, ran another 4 hrs, and now apparently this time the watchdog was able to end it properly. dcdc, 6x runtimes is not the answer either. I mean that would work for the 1hr runtime preference, but if you have a 24hr preference, it would take many days for the watchdog to realize there's a problem, when one occurs. Ideally the processing on the project server would be sophisticated enough to not assign long running work like this to a host with a short runtime preference. But both the watchdog and the runtime preference are unique to Rosetta, rather then being part of BOINC. So the BOINC routines that assign the work are not that sophisticated. I realize that is a more complete statement of the problem you are reporting. But it's less likely to happen any time soon. The watchdog not being able to shutdown a task when it wants to is not how things should work. The rest are ways to make things work better (i.e. enhancement vs bug fix). Rosetta Moderator: Mod.Sense ID: 52594 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 52596 - Posted: 18 Apr 2008, 16:18:35 UTC Last modified: 18 Apr 2008, 16:20:04 UTC Thanks for the reply. I had 2 recent CAPRI15 WU's: FRA_t038_CAPRI15_1n82_1_IGNORE_THE_RESTt038_1_input.pdb_3060_216104_0 FRA_t038_CAPRI15_1n82_1_IGNORE_THE_RESTt038_1_input.pdb_3060_235298_0 First one: CPU time 314.0625 stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 3600 # random seed: 2888897 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x7C910F29 read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... # cpu_run_time_pref: 3600 ERROR:: Exit from: .initialize.cc line: 1614 </stderr_txt> ]]> Validate state Invalid Claimed credit 0.840314964353539 Granted credit 0 Second one: CPU time 17240.72 stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 3600 # random seed: 2869703 ******************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 17240.7 seconds. Greater than 4X preferred time: 3600 seconds ******************************************************************** GZIP SILENT FILE: .aat038.out </stderr_txt> ]]> Validate state Valid Claimed credit 46.1297831235163 Granted credit 80 The first one really ran for > 4hours but the reported time is 314 seconds. ID: 52596 · Rating: 0 · rate: / Reply Quote