Message boards : Number crunching : Report long-running models here
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next
Author | Message |
---|---|
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
As you can see my chosen runtime is 12 hours but this task ran for 20 hours. Stealing the sentiment of greg_be "This sucks". 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_97328_0 Claimed credit 153 Granted credit 76 (benevolent of you) # cpu_run_time_pref: 43200 ====================================================== DONE :: 1 starting structures 71773.1 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=218071526 This WU is 5 hours and 18 minute in out of 12 hour runtime and is still on the first decoy. Running BOINC 6.4.5 Rosetta Mini 1.47 Running Vista with AMD quad 9500 Phenom, 3 gigs RAM I have been getting lots of WUs where I have been getting maybe 1 to 3 decoys out of 12 hours CPU time and very little credit for the invested time. The last 3 or 4 days have been so bad credit wise my RAC is dropping. |
Wissi Send message Joined: 19 Nov 08 Posts: 14 Credit: 485,807 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451 Workunit 198419747 Name is 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_156441_0 for Rosetta Mini 1.47. Now stays already for a very long time at about 96%, the last estimated 15 minutes usually take at least another 2 hours (or more). All my last results showed the info, that the watchdog ended the runs, since the used time is more then 3 times the preferred time. BTW: this unit is still running in model 1. The second task, running right now with the same symptoms is https://boinc.bakerlab.org/rosetta/result.php?resultid=218092186, Workunit 198571904. Its name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_110537_1 (this unit already has one client error as result). So either the estimations are not good, especially because I have the impression, that the last 10% of the estimated time are using 90% of the time in real. I don't think that this is a problem of my computer, since the benchmark results are known to the schedulers... |
Wissi Send message Joined: 19 Nov 08 Posts: 14 Credit: 485,807 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451 Interesting effect: after shutting down and restarting my computer (due to a software installation), the CPU time used by the two processes went down from about 4:30 hours now to about 2:45, the completion percentage went down from about 96% now to about 94%. I forgot to mention, that I use BOINC Manager 6.4.5 Anyone else who has seen this effect? |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=218071526 Same task mentioned in my last post is now 28 hours in on 1st decoy and stuck at 99 percent with 10 minutes to go. I now have the choice of aborting-----for 0 credit or letting watchdog abort it in another 10 hours for very little credit. |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=218071526 Task ID 218071526 Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_113811_1 Workunit 198586483 Created 31 Dec 2008 15:42:51 UTC Sent 31 Dec 2008 16:03:51 UTC Received 1 Jan 2009 21:18:15 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 948562 Report deadline 10 Jan 2009 16:03:51 UTC CPU time 101439.9 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 43200 ====================================================== DONE :: 1 starting structures 101440 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 392.681268578232 Granted credit 37.8765655196635 application version 1.47 Thanks for the credit. |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
Workunit 198419747 This is quite normal. On restarting it's gone back to your last checkpoint, which appears to have been 2h 45m and begins again from there. That's about 94% of the default 3 hour run time. Once the time to completion gets to 10 minutes it stops reducing, then just shows (runtime/(runtime+10mins)) as a percentage of work done until completion. Claimed credit 392.681268578232 I just saw that. Ouch! :( |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
cc2_1_8_mammoth_mix_fa_cst_hb_t305__IGNORE_THE_REST_1LARA_4_6175_89_0 took 21 hours to produce one decoy. cc2_1_8_mammoth_mix_cen_cst_hb_t305__IGNORE_THE_REST_1YGRA_5_5874_85_0 took over 28 hours to produce one decoy. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=218058455 cc2_1_8_mammoth_mix_cen_cst_hb_t342__IGNORE_THE_REST_2G0QA_2_5889_234_0 CPU time 15509.59 cpu_run_time_pref: 14400 credit was great even with the over run |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
Claimed credit 392.681268578232 People are experiencing similar results. So how can the credit system be working right if there is such a large disparity between claim and grant? I thought the grant was a running average of claims? Apparently too few decoys are being produced. Something smells; and if the smell gets strong enough I will, er... "have to leave the room". |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Claimed credit 392.681268578232 can you post what task that was? with large over runs the credit system gets messed up from what i can tell. i've had stuff that goes 2 hours over limit and returns a lousy credit. |
Wissi Send message Joined: 19 Nov 08 Posts: 14 Credit: 485,807 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451 Okay, here comes again, what I saw: yesterday, when shutting down my computer, workunit 198419747 had a processor time of 6 hrs 17 min and 40 seconds. Boinc told me, that 97.418% of the work was done. 9 mins 40 seconds was the time estimated to the end of the job. Workunit 198571904 had 6 hrs, 11 min and 17 seconds processor time, and 97.376% of the work was done. 9 mins 30 seconds was the estimated time till the end of the job. Now, a few minutes ago, I restarted my computer, and so the boinc manager restarted the computation. Now the values are as follows: Workunit 198419747: 4 hrs 8 mins 36 seconds of processor time, 96.131% work done, and 15 minutes and 1 seconds as an estimation until the end. Workunit 198571904: 3 hrs 52 mins 33 seconds of processor time, 95.877% work done, 15 minutes 33 seconds as the estimation. Now if this continues in THIS way (in fact, now for the 2nd time, there are several hours of processor time "stolen" due to a restart), I also will have to leave the room and set rosetta to "don't get any more tasks". The credits are only a symbolic value, but as another one has written here already: something smells, and not very good. I will wait until the watchdog stops both jobs, because I don't want to loose the symbolic credits (which I will, when I stop the job myself). There is a bigger problem somewhere in the system and this has to be investigated. I think, there is enough evidence, when I read all the posts in this thread here. There is another task waiting for execution: https://boinc.bakerlab.org/rosetta/result.php?resultid=218251155, workunit 198873439. It has an expected time to run of 14 hours, 27 minutes and 27 seconds. I thought, the preferred runtime would be something between 3 and 6 hours. So maybe, the results of the benchmark of this computer seem to be totally misinterpreted by the scheduler, who sends me those tasks. By the way: happy new year. |
jay Send message Joined: 12 Jan 08 Posts: 20 Credit: 195,801 RAC: 0 |
Hi, Another long running task... Full WU name (you can copy the BOINC message from when the task completes). wuid=197237636 cc_nonideal_1_3_nocst4_hb_t303__IGNORE_THE_REST_2AH5A_6_5991_15 Type of operating system (version of Windows, Linux distribution, or Mac info.) Windows XP SP3 & updates BOINC version (see BOINC Manager "About" page). 6.4.5 wxWidgets version 2.8.7 Rosetta version (see BOINC Manager "tasks" page). Rosetta mini 1.47 A link to the task's results page. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197237636 If a specific model took longer then the rest of them, then what model # was shown in the graphic? model: 2 step: 1879201,...1880095,...,1881207,...,1882051, The graphic RMSD to Accepted energy plot is very scattered - no dense concentration . a blue strand and a green strand are waving at each other. They extend from a clump of the rest of the material. My notes. Task gets stuck at 97%. This is running on a laptop. I had to power off last night to drive home. But before I suspended all tasks, I noticed that this task had run over 3 hours and was stuck at 97%. I also noted over 3,000,000 page faults on the Rosetta task. I have 2 Gig of memory. Task Manager says 1.2 gig of physical mem are available. When I got home last, I restarted Boinc, Running Rosetta, WCG, and spinhenge. I noticed that the task lost its work and started over. This morning - after 9 hours real time the task was stuck again. I just (around 9:00AM EST) suspended all other tasks - and set 'no new tasks' so there will be no swapping out of the rosetta task and possible loss after checkpoint restart. After 1/2 hour of a single BOINC task running on an Intel duo, it still shows: Progress 97.520 % To complete: 15.04 minutes CPU time: 06:34:44 The 'to complete time' changes every 6 to 10 seconds and either increases or decreases a second. Ooops, it just stayed over minute on 15.01 seconds. I'll try to get a 10 minute cpu time interval... Progress 97.592 % To complete: 14.55 minutes CPU time: 06:45:21 I'll post this and see what is suggested. Thanks in advance!! Jay PS More data: Wall Clock: 10:00AM Progress 97.750% To complete: 00:14:37 minutes CPU time: 07:14:17 |
jay Send message Joined: 12 Jan 08 Posts: 20 Credit: 195,801 RAC: 0 |
Hi, The task did complete at wall clock time 11:04AM Jay |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
Claimed credit 392.681268578232 See Rifleman's posts 58332 and 58333 in this thread. I also posted one earlier. |
quadro Send message Joined: 22 Oct 08 Posts: 3 Credit: 10,085,084 RAC: 0 |
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...for those concerned about credit and fairness and what smells and what doesn't... please read the original post of this thread. Models that are taking significantly longer then average will receive significantly less credit then claimed. That's how an average works. The large claim goes in to the average, but only for one model, as compared to 1000s of others. So, it ups the average, but how much depends on how common the models run long. That is one of many good reasons to work to eliminate these long-running models. Another is that if long-running models can be eliminated, then the estimated runtimes will be more reliable and work fetch more predictable. The new approaches being used to study the proteins seem to have a higher variability between models then we are all used to. The team has reviewed the information in this thread and is working on some approaches to addressing the long-running models, and to studying them further. Rosetta Moderator: Mod.Sense |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
...for those concerned about credit and fairness and what smells and what doesn't... please read the original post of this thread. Models that are taking significantly longer then average will receive significantly less credit then claimed. That's how an average works. The large claim goes in to the average, but only for one model, as compared to 1000s of others. So, it ups the average, but how much depends on how common the models run long. If a fast CPU runs flat out for 28 hours and generates one decoy------there must have been a hell of a lot of work done to figure the decoy out? I have had over a week of these difficult units and credit for them is abysmal compared to what earlier WUs were awarding. It's almost like folks with long runtime preferences are being penalized for it. I am new to the project and distributed computing in general but increasing my hydro bill by significant amounts there should be closer attention to the way these credits are awarded. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
mod, how do you explain the weirdness of this users tasks that were running at 6 hrs and still a long ways to go to completion. then upon reboot of the computer nearly the same amount of work is shown completed for less time used. kind of some odd stuff going on with his tasks. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
mod, It all relates to the time to completion estimate. As was stated earlier, when the machine rebooted, the task reverted to it's last checkpoint. And at that point, the % complete is going to based on the runtime preference. In short, the task should be about to proceed down the same path. Running for too long, showing about 10 minutes to go the whole time. There's no exceptional weirdness described there. It is simply how the symptoms appear when you have a long-running model. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Report long-running models here
©2024 University of Washington
https://www.bakerlab.org