Message boards : Number crunching : Report long-running models here
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next
Author | Message |
---|---|
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
3) You can upgrade BOINC any time. Even with work in progress. The Rosetta application is still the same, and this is what is truely processing the work, so the BOINC upgrade should not pose a problem. Not necessarily wrong, just not necessarily right ... :) I am one of the "lucky" ones and have up and down leveled BOINC versions with abandon and I don't think I have ever lost work. There have been issues where different BOINC versions calculate things differently and that can cause issues when new versions are used. For example, the later versions of 6.6.x use a very different LTD model than the old versions. SO, there can be instances where changing version can cause issues, and more work downloaded. I have three very nearly identical systems and they were all connected to WCG where I was trying to get enough work from the new sub-project so I could get my "gold" and one of them has 268 tasks ... why it has so many more than the other two, I have nary a clue ... but it is gamely trying to work through all those tasks before their deadline ... but I am still scratching my head why one downloaded so much work, and the other two only got reasonable amounts ... oops, down to only 244 tasks that are likely to miss deadline ... :) |
Cesium_133* Send message Joined: 1 Dec 08 Posts: 28 Credit: 225,332 RAC: 0 |
A long-running 1.67 workunit: Here's one more to add to the corpus of aborted WU's: threading_lb_test1_hb_t373__IGNORE_THE_REST_11850_3473_1 Aborted 3 June 09 23:44:52 EDT (I guess 03:44:52 4 June 09 UTC) Was 5% done after 5 hours, original estimated run time was about 3h 50m, time to completion was increasing directly 1:1 with the time spent on it, the WU was not performing, and no graphics were visible despite the mini-view's assertion to the contrary. The other Rosetta WU my PC was crunching was running fine, and BOINC had defaulted to an AI WU which was completed; it then apparently went back to the one I aborted, with no success. Next time I might try closing and re-opening BOINC, as I only saw that suggestion after I aborted the task. I do hope someone keeps a record of all WU's aborted or otherwise; perhaps there's a (or more than 1) common thread(s) to them... The lovely lady you see isn't I, but Hayley Westenra, a classical crossover singer from Christchurch, NZ. There is no known voice as hers. Check her out- she's seraphic. |
TestPilot Send message Joined: 23 Sep 05 Posts: 30 Credit: 419,033 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=233872277 Rosetta 1.71 Aborted after 27 hours of crunching and counting... And it was assigned to another puter... TestPilot, AKA Administrator |
michaelmastro Send message Joined: 11 Oct 05 Posts: 51 Credit: 1,530,918 RAC: 0 |
This unit has been running for 13 hours, is 28% complete, with 17 hours remaining: lb_alnmatrix_threading_alncap__hb_t308__IGNORE_THE_REST_12574_4927_0 using minirosetta version 171 Windows Vista BOINC 6.6.20 Rosetta Mini 1.71 https://boinc.bakerlab.org/rosetta/results.php?userid=3968 This unit is also running, only in 2.5 hours it has completed over 80% with .5 hour remaining: lr5_E_rama_map_iter05_rlbd_1ubi_SAVE_ALL_OUT_12503_440_0 using minirosetta version 171 |
michaelmastro Send message Joined: 11 Oct 05 Posts: 51 Credit: 1,530,918 RAC: 0 |
BTW - This problem only occurs on my Vista machine. The Mac is having no problems... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
michaelmastro, it looks like you are running a 6.6 BOINC Manager version. Please select the task in question and click properties. What is shown for the CPU time used by the task? It looks like your prior tasks were running for the 3hr default runtime. If this task has more then 7 hours of actual CPU time(4 hrs over your preference, where the watchdog should have ended it) then it would sound like something isn't right and you should abort it. Rosetta Moderator: Mod.Sense |
William T.M. Theisen Send message Joined: 11 Sep 06 Posts: 7 Credit: 527,145 RAC: 0 |
lb_dk_ksync_withtrim_hb_t297__IGNORE_THE_REST_12980_1893_0 Got stuck at 6.888% and has been running 29 hours so far, and has gone up in time for "time to completion" from 60 hours to 65 hours. I'm not sure what is going on with it, should I abort it? |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Not happy about this at all, one of the few tasks i've been able to get and this happens. Ten hours for one model on my 3ghz great credit to, others had problems with it to. 1qlx_NNMAKE_CONSTRAINT_BOINC_ABRELAX_SAVE_ALL_OUT_14240_677_2 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=243170635 36,076.19__106.17__3.24 |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
And another. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=243854311 abinitio_withrelax_homfrag_129_B_1ubi__SAVE_ALL_OUT_13795_832_0 # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 23518.3 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== That's six & a half hours. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
I have another over 10hrs for 1 model. mini 1.87. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=245101531 abinitio_withrelax_homfrag_129_B_1vcc__SAVE_ALL_OUT_13795_3017_0 |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Long running task: 269551688 name: lr10_seq_score12_rlbd_1prq_IGNORE_THE_REST_DECOY_13841_3329_0 application version: 1.90 OS: Linux AdeB |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Long running task: 272664497 name: lr8_newhb_run02_rlbn_2apb_IGNORE_THE_REST_NATIVE_NOCON_14611_463_1 application version: 1.91 OS: Linux CPU time: 57738.5s, 14400s + 43200s Granted credit: 4.01992761072857 AdeB |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. I aborted this when i looked at graphics after 3hrs,28min it was sitting on model 1 step 0 and not moving, so it's gone! lr5_combine_smooth_torsion_it00_A_rlbd_2hkv_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14643_667 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=251816801 |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Long running task: 278731357 name: lr8_A_seq_score12_ss1.7_rlbd_2ccv_IGNORE_THE_REST_DECOY_14637_3189_0 application version: 1.97 OS: Linux AdeB |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
A couple of strange, long-running WUs here. Both successfully completed and credit awarded, but both ran in excess of 8 hours with a 4 hour default runtime: lr5_score12_gb_run01_rlbd_1unp_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_62_0 CPU time 29385.66 [...] lr5_score12_gb_run01_rlbd_1ig5_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_38_0 CPU time 28843.29 [...] |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
A couple of strange, long-running WUs here. Both successfully completed and credit awarded, but both ran in excess of 8 hours with a 4 hour default runtime: Exactly the same for me too. lr5_score12_gb_run01_rlbd_1ugh_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_136_0 CPU time 28841.62 [...] |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
I'm seeing the same thing with lr5_score12_gb* workunits. This workunit 282249916 ran for over seven hours on Mac OS X 10.5, eventiually failing as follows; Fullatom mode .. Hbond tripped: [2009- 9-21 5:25:58:] BOINC:: CPU time: 25274.8s, 14400s + 10800s[2009- 9-21 12:55:41:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) For most of that the time it was stuck on initialising: (Model 0 Step 0). Here's the output of the Sampler while it was doing that. Sampling process 21361 for 3 seconds with 1 millisecond of run time between samples Sampling completed, processing symbols... Analysis of sampling minirosetta_1.97_i686-apple-darwin (pid 21361) every 1 millisecond Call graph: 2116 Thread_2507 2116 start 2116 _start 2116 main 2116 protocols::relax::Relax_main(bool) 2116 protocols::jd2::BOINCJobDistributor::go(utility::pointer::owning_ptr<protocols::moves::Mover>) 2116 protocols::jd2::JobDistributor::go(utility::pointer::owning_ptr<protocols::moves::Mover>) 2116 protocols::jd2::JobDistributor::go_main(utility::pointer::owning_ptr<protocols::moves::Mover>) 2116 protocols::relax::SimpleMultiRelax::apply(core::pose::Pose&) 2116 protocols::relax::ClassicRelax::apply(core::pose::Pose&) 2116 protocols::moves::RampingMover::apply(core::pose::Pose&) 2116 protocols::moves::TrialMover::apply(core::pose::Pose&) 2116 protocols::moves::JumpOutMover::apply(core::pose::Pose&) 2116 protocols::moves::MinMover::apply(core::pose::Pose&) 2116 core::optimization::AtomTreeMinimizer::run(core::pose::Pose&, core::kinematics::MoveMap const&, core::scoring::ScoreFunction const&, core::optimization::MinimizerOptions const&) const 2116 core::optimization::Minimizer::run(utility::vector1<double, std::allocator<double> >&) 2116 core::optimization::Minimizer::dfpmin_armijo(utility::vector1<double, std::allocator<double> >&, double&, core::optimization::ConvergenceTest&, bool) const 2116 core::optimization::ArmijoLineMinimization::operator()(utility::vector1<double, std::allocator<double> >&, utility::vector1<double, std::allocator<double> >&) 2116 core::optimization::AtomTreeMultifunc::operator()(utility::vector1<double, std::allocator<double> > const&) const 2116 core::scoring::ScoreFunction::operator()(core::pose::Pose&) const 2116 core::scoring::ScoreFunction::eval_onebody_energies(core::pose::Pose&) const 2116 core::scoring::methods::OmegaTetherEnergy::residue_energy(core::conformation::Residue const&, core::scoring::EMapVector&) const 2113 core::scoring::OmegaTether::eval_omega_score_residue(core::conformation::Residue const&, double&, double&) const 2113 core::scoring::OmegaTether::eval_omega_score_residue(core::conformation::Residue const&, double&, double&) const 3 0xffffffff 3 _sigtramp 3 _sigtramp 2116 Thread_2603 2116 thread_start 2116 _pthread_start 2116 timer_thread(void*) 2116 boinc_sleep(double) 2116 usleep 2116 nanosleep 2116 mach_wait_until 2116 mach_wait_until 2116 Thread_2703 2116 thread_start 2116 _pthread_start 2116 protocols::boinc::watchdog::main_watchdog(void*) 2116 sleep 2116 nanosleep 2116 mach_wait_until 2116 mach_wait_until Total number in stack (recursive counted multiple, when >=5): Sort by top of stack, same collapsed (when >= 5): mach_wait_until 4232 core::scoring::OmegaTether::eval_omega_score_residue(core::conformation::Residue const&, double&, double&) const 2113 Sample analysis of process 21361 written to file /dev/stdout |
CraniuMod Send message Joined: 11 Jan 08 Posts: 3 Credit: 565,798 RAC: 0 |
Will keep an eye out for this from hereon out. Has not happened on Rosetta before. 281082606 Name 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 Workunit 256342237 Created 15 Sep 2009 20:41:03 UTC Sent 15 Sep 2009 20:53:55 UTC Received 23 Sep 2009 20:09:18 UTC Server state Over Outcome Client error Client state Aborted by user Exit status -197 (0xffffff3b) Computer ID 926185 Report deadline 25 Sep 2009 20:53:55 UTC CPU time 19541.34 stderr out <core_client_version>6.6.36</core_client_version> <![CDATA[ <message> aborted by user </message> ]]> Validate state Invalid Claimed credit 78.92439165502 Granted credit 0 application version 1.97 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Will keep an eye out for this from hereon out. Has not happened on Rosetta before. Did you abort this task or what happened? 5.5 hrs is not really a long run. |
CraniuMod Send message Joined: 11 Jan 08 Posts: 3 Credit: 565,798 RAC: 0 |
Will keep an eye out for this from hereon out. Has not happened on Rosetta before. Client was reporting this as running for 38 hrs. I did abort. Went back to client log and found the below 9/23/2009 4:07:32 PM rosetta@home task 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 resumed by user 9/23/2009 4:07:33 PM rosetta@home Restarting task 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 using minirosetta version 197 9/23/2009 4:08:15 PM rosetta@home Task 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 exited with zero status but no 'finished' file 9/23/2009 4:08:15 PM rosetta@home If this happens repeatedly you may need to reset the project. 9/23/2009 4:08:15 PM rosetta@home Restarting task 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 using minirosetta version 197 9/23/2009 4:08:23 PM rosetta@home task 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 aborted by user 9/23/2009 4:08:24 PM World Community Grid Resuming task faah8210_ZINC04849622_xmdEq_2R5P1c_01_0 using faah version 607 9/23/2009 4:08:38 PM rosetta@home update requested by user 9/23/2009 4:08:44 PM rosetta@home Sending scheduler request: Requested by user. 9/23/2009 4:08:44 PM rosetta@home Reporting 2 completed tasks, not requesting new tasks 9/23/2009 4:08:48 PM rosetta@home Scheduler request completed 9/23/2009 4:08:48 PM rosetta@home [error] garbage_collect(); still have active task for acked result 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0; state 5 9/23/2009 4:08:49 PM rosetta@home Computation for task 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 finished 9/23/2009 4:08:49 PM rosetta@home Output file 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0_0 for task 1fv5A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1fv5A-_14711_576_0 absent |
Message boards :
Number crunching :
Report long-running models here
©2024 University of Washington
https://www.bakerlab.org