Message boards : Number crunching : Report "hombench_..." issues here!
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Report WU issues for the hombench_XXX WUs here please! We'd like to keep the science related thread on topic (i.e. about the science of this project) but we love to hear your feedback on the actual WU's also, so please post them here. I've already had some info come back on spurious WUs that take waaay too long. We're looking into it now and will hopefully resolve that soon! Thanks, Mike Tyka Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 23 |
I have one of these hombench_ units now, this one. I have 3 hours set as my length, but this thing has been crunching for approaching 6 hours now. I opened the graphics, (something I rarely do), and saw that the structure it was currently "accepting" was a long way, (not kilometers, light years!), from the native. The time to completion is 00:09:51, but then it has been that for the last 2-3 hours... Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 23 |
Something seriously wrong there. This machine typically claims and gets 50 - 60 for a 3 hour wu, this wu was over 7 hours and claimed 148 and was granted 20!!!!! Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
R.L. Casey Send message Joined: 7 Jun 06 Posts: 91 Credit: 2,728,885 RAC: 0 |
Something seriously wrong there. This machine typically claims and gets 50 - 60 for a 3 hour wu, this wu was over 7 hours and claimed 148 and was granted 20!!!!! Hi adrianxw, I can't help but wonder that the "large" size of the model required much more cache memory to run efficiently. The computer in question shows only 244 kilobytes of cache (per CPU, I assume) and returned only 2 decoys. I wish that I knew more that might be helpful. Good luck, and keep on crunching! |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
Something seriously wrong there. This machine typically claims and gets 50 - 60 for a 3 hour wu, this wu was over 7 hours and claimed 148 and was granted 20!!!!! I had a similar problem with minirosetta 1.34: hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_724 It took 9 hours 25 mins to complete showing 9 mins 52 seconds left for atleast 3 hours of that. I'd quite like to get in the top 100,000 contributers so I decided to check the credit for this, claimed credit was 127, granted was 22 when I normally get around 35-40 for a 3 hour packet At the end of the day I don't really care about the credits, it's just a nice little motivating factor, but I'm sure there are people who do and will start to abort these units so as not to lose credits, especially if their system dosn't kick out that much each day to start with. |
Nils Send message Joined: 27 Feb 08 Posts: 1 Credit: 7,593 RAC: 0 |
I had the same problem as adrianxw. After at least 20 hours of crunching ( usually 3 or 4 hours ) i aborted this WU hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t326__IGNORE_THE_REST_2A9VA_9_5040_1 after i noticed that it got stuck at 98 % for hours. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 23 |
The computer in question shows only 244 kilobytes of cache (per CPU, I assume) and returned only 2 decoys. The machine has a Q6600 processor i.e. 2MB of cache per CPU. I maintain that this, and other wu's of this type are screwed up somehow, but nobody on the project seems to want to run with this particular ball, ... as is becoming the norm. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
R.L. Casey Send message Joined: 7 Jun 06 Posts: 91 Credit: 2,728,885 RAC: 0 |
The computer in question shows only 244 kilobytes of cache (per CPU, I assume) and returned only 2 decoys. Hi adrianxw, In looking at your WU and the one reported by Hubington, they are both of the form 'hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_nnn', where 'nnn' changes. It does appear that for this 't328' case, the code is getting wrapped around the axle for a Q6600 core to run around 13,000 CPU seconds per decoy! This will no doubt be interesting to Mike Tyka, since the whole point of his new code (only operating since Sept. 21) is to benchmark various methods. Your WU could be invaluable to him to eliminate some strategy that breaks down in some cases. This is research, and I am reminded of the saying "If we (really) knew what we were doing, it wouldn't be research."! The good news in this is that three other 'hombench..." WUs you crunched did work in a 'nominal' way, giving many more decoys and being awarded your more average amount of credit per CPU time. Hopefully, Mike T. will be able to look into this case soon, if he hasn't already. Until then, know that you are contributing greatly! Thanks for crunching Rosetta! |
Dalton Send message Joined: 30 Nov 05 Posts: 2 Credit: 27,777,725 RAC: 0 |
I too also had the same problem as adrian & Nils After at least 52 hours of crunching i gave up on this WU hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t313__IGNORE_THE_REST_1D9XA_4_4571_5_0 after i noticed that it got stuck at 98 % for hours. it would get maybe 0.0001% done a day after 98%. Where a normal WU is 3-4 hours on a T7300. Noticing now in messages that this WU keeps restarting. |
Barraud Denis Send message Joined: 8 May 06 Posts: 6 Credit: 1,258,677 RAC: 0 |
hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t328__IGNORE_THE_REST_1ET0A_11_4578_8_0 using minirosetta version 134 q6600 Xp 32bits 2*2Go DDR2-8500 -> 21:52:20 à 99,243% reste 00:09:56 sur mon BOINC 6.3.10 j'ai du Suspendre la WU pour vérifier qu'elle n'est pas défectueuse, Apparemment non, mais j'attends de voir si elle reprend du service. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Barraud is running BOINC 6.3.10 and had a hombench task that ran nearly 22hrs and still shows about 10 minutes remaining. He suspended it to see if it would complete or continue running. Sounds like BOINC is still running other tasks and hasn't come back to it. From looking at other tasks of Barraud's it seems he has a 3 hour runtime preference. Merci Barraud. Je dis l'arrêt il. C'est troup des temps. Rosetta Moderator: Mod.Sense |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
New one on the way minirosetta 1.34: hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t286___4580_1561_0 currently been running for 36 hours & 5 mins! 99.540% complete OK I just noticed something VERY worrying while trying to see how long it took to click over 0.001%, the run time jumped back 6 mins?!?!?! and now it lost 0.001% from the progress taking it back to 99.539 running on AMD dual cores of 2.41Ghz (4800+ combined) if that makes a difference. 64bit chip with a 32bit OS When it is makeing progress it looks as though it's taking 5 mins to get 0.001% but if the CPU run time is constantly jumping back as I observed it do, then who can say what the run time really is! |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
Finished at 39 hours 35 mins credit claimed: -- credit granted: -- outcome: Validate error |
Barraud Denis Send message Joined: 8 May 06 Posts: 6 Credit: 1,258,677 RAC: 0 |
Suite de / Previously : Message 56174 - Posted 2 Oct 2008 20:21:07 UTC hombench_mtyka_foldcst_loopbuild_boinctest3_foldcst_loopbuild_t328__IGNORE_THE_REST_1ET0A_11_4578_8_0 using minirosetta version 134 q6600 Xp 32bits 2*2Go DDR2-8500 sur mon BOINC 6.3.10 j'ai du Suspendre la WU pour vérifier qu'elle n'est pas défectueuse, Apparemment non, mais j'attends de voir si elle reprend du service. MAintenant / Now : more information about my boinc / roseta parameters : my boinc preferences : switch between application every 80 minutes boinc -> project -> 'partages des resources': 6,25% my roseta preferences : Target CPU run time : 4 hours The WU restart at it turn in boinc ! it always run again with : now...: 26:57:00 à 99,385% reste 00:09:53 26:47:00 à 99,380% reste 00:09:56 before: 21:52:20 à 99,243% reste 00:09:56 I only have change OS priority from Base TO Normal in XP's Task Manager, to see if it change something for Wu. next to see. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Barraud, with your preferences, that task should have been ended by the watchdog because it has been running for more then 5 times your normal runtime preference. Please abort it. Rosetta Moderator: Mod.Sense |
Barraud Denis Send message Joined: 8 May 06 Posts: 6 Credit: 1,258,677 RAC: 0 |
Barraud, with your preferences, that task should have been ended by the watchdog because it has been running for more then 5 times your normal runtime preference. For the moment i prefer to let runing it, i notice wu's death line are for 6/10/2008 21H40.. but i want to know if you will need some of file of the wu for analyse. I could zip the task slot directory and send you later if you need it. INFORMATION : in Boinc ->Message : it seemed the wu restart around every 15 minutes ? Confirmed : in task manager and boinc'message : it restart the wu every 15 minutes... the wu restart in a loop at the same point every 15 min ?? now : 27:03:30 - 99,387% - 00:09:56 I will abort the wu later,in regard of time i crunch it, i can let it running a few time more before borted it. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I could zip the task slot directory and send you later if you need it. No, all that should be needed is the task name, and the description of what seems abnormal about it, which you have already provided. Thanks. Rosetta Moderator: Mod.Sense |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,236,942 RAC: 3,767 |
I could zip the task slot directory and send you later if you need it. G'Day Mod.Sense, Can you let the power that be know that this "hombench" thing is still occuring with the latest 1.36 Ralph work units as well. I have just reported on the Ralph forum the same problem that adrianxw has already reported. My preferences set to 6 hours but had 3 WU's go to 11 hours and one past 8 hours, the 11 hour WU's then reset themselves to zero and started again. I did not want to waste another day on 4 WU's so I aborted them after another half an hour. They all show 9 minutes 57 seconds still to go for over 4 hours. I might abort the lot of them yet if all I am going to get for 11 plus hours of processing is 20 points. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Thanks Conan. From what I'm seeing, within the hombench, it depends greatly on which protein the task is for. If you can avoid aborting them, I would let them run unless/until they take longer then the roughly 2hrs per model mentioned in the long-running models thread. Rosetta Moderator: Mod.Sense |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
I've tracked down some problems with the hombench_ WUs, so the next batches going out soon (in preparation now) should take considerably less time, certainly less than 2-3 per model, more likely (depending on the sie of the protiens) less than half an hour. THe reasons for the long WUs were to do with the size of the protiens, which is why the problem was much worse for some of the guys than others. Some of the proteins in the hombench WUs are larger than the usual stuff we had run un BOINC before. THe refinement stage of the code was using an older algorithm that turned out to scale poorly with protien size. I've replace that part with an almost as effective, but much much more efficient algoithm. THanks for alerting us to this problem. FOr some of the smaller sized WUs i've sent out after noticing (e.g.hombench_mtyka_looprelax_ccd_moves_2_looprelax_ccd_moves_t302_) i'm seeing as much as 10models / hr now ! while the larger proteins (e.g. _t293 that was previously causing trouble) are now down to an acceptable 2 hours per model. This is exciting! THere'll be a bunch of stuff going out soon. Once we've got some preliminary results we'll display them in the science thread. Mike Mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
Message boards :
Number crunching :
Report "hombench_..." issues here!
©2024 University of Washington
https://www.bakerlab.org