Message boards : Number crunching : Why so much variation?
Author | Message |
---|---|
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
I was wondering why there is so much variation in the amount of credit granted per "CPU second" - cut and pasted below is a snippet of the task log from one of my systems: 342017570 312236633 28 May 2010 23:44:20 UTC 29 May 2010 14:25:59 UTC Over Success Done 20,454.03 136.45 142.50 341972687 312198152 28 May 2010 19:06:34 UTC 29 May 2010 13:42:55 UTC Over Success Done 21,496.80 143.41 140.66 341951763 312179864 28 May 2010 17:08:08 UTC 29 May 2010 15:34:28 UTC Over Success Done 36,419.22 242.96 131.91 341942478 312171727 28 May 2010 16:14:04 UTC 29 May 2010 12:22:14 UTC Over Success Done 21,556.83 143.81 183.73 All of these tasks completed successfully. All were run on the same machine, pretty much concurrently. If you note, the third task in the list ran for 36,419 seconds, or about 10 hours. My current target run time is set to 6 hours so I guess the "watchdog" got it. Most tasks on this system run about 6 hours and return about 140 credits. Give or take. The task in question returned 132 credits. Broken down by hour of time, a core on this system generally returns about 23 credits for an hour of work. In this case I got about 13 credits per hour. I understand about differences in systems causing slight variations in the amount of credit given, but these were run on the same system. My concern here is not so much the credit, but the possibility that there is some "bug" that is causing the occasional task to consume excess CPU cycles while not producing any useful work. I am running a "hybrid" Linux system which cuts the middle ground between Debian and Ubuntu. I have the 6.10.56 version of the BIONC software installed. Any insight would be appreciated. |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
The first post in "The new credit system explained" should answer most of your questions. If there is anything it doesn't answer feel free to ask again here. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
First, thanks so much for the reply and for not taking the "read the $%#@$ FAQ, you moron" approach I reviewed the post you directed me to (again) and I guess that all I can say is that if there really is an expectation that the level of effort to work a model will vary that much then maybe they should renew the discussion on the "first xxx" results started back in 2006. However, the way they "temper" subsequent results really is very clever. I don't know |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
After watching several more of these long running tasks float through my systems today I really have to believe that there is something seriously wrong. It is my understanding that Rosetta does not utilize GPU processing so the odds are that the vast majority of work units are processed on X86 silicon – Intel or AMD. So while there will be differences driven by the diversity of Operating Systems and specific chips in use, I think that the differences I am seeing are far outside the bounds of what can be expected. For example, I just had a job running on a dedicated AMD Phenom II 925 – standard clocking and 4 gig of memory. This job ran over 10 hours, claimed a credit of about 227 credits, and was granted just 32 credits. This nets me just a little over 3 credits per CPU hour. On jobs which terminate “normally” at the 6 hour point I currently have set in my preferences I seem to net about 20 to 25 credits per CPU hour. That is a factor of about 7 – far greater than what I would expect due to the differences in the various X86 systems. And the opposing factors in the averaging algorithm necessary to knock this task from 227 to 32 credits boggle the mind. Which raises the question – what function is being utilized in these jobs which is so horribly inefficient on Linux or AMD systems? Once again, this is not about the “credit race” but rather it seems to be a talisman that something is very wrong in either AMD land or Linux world. Over the past week I have been increasing my desired run time towards the requested 8 hours, but I am having second thoughts about going further in that direction until I understand what is going on. I see no reason to extend the amount of time I potentially have a core “spinning its wheels” but getting very little productive work done. This phenomenon seems to happen on about 5 to 10 percent of the work units I process. I have pasted the output of the job I used as an example below. Once again, I do appreciate the time you take responding to this note and sharing your knowledge with me. Task ID342079197 Nameint2_centerfirst2b_1fAc_2rb8_ProteinInterfaceDesign_23May2010_21231_6_0 Workunit312290430 Created29 May 2010 6:33:44 UTC Sent29 May 2010 6:34:42 UTC Received30 May 2010 0:43:09 UTC Server stateOver OutcomeSuccess Client stateDone Exit status0 (0x0) Computer ID1281342 Report deadline8 Jun 2010 6:34:42 UTC CPU time36501.92 stderr out <core_client_version>6.10.56</core_client_version> <![CDATA[ <stderr_txt> [2010- 5-29 8:55:17:] :: BOINC:: Initializing ... ok. [2010- 5-29 8:55:17:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 21600 BOINC:: CPU time: 36499.7s, 14400s + 21600s[2010- 5-29 19:37:32:] :: BOINC InternalDecoyCount: 156 ====================================================== DONE :: 2 starting structures 36499.7 cpu seconds This process generated 156 decoys from 156 attempts ====================================================== called boinc_finish </stderr_txt> ]]> |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
You might want to post the details in the Report long-running models here thread. Don't worry about asking questions, BOINC is so complicated that I doubt any one person knows everything about it and its numerous projects. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Thanks to both of you for your responses. After reviewing again the first post in the "Long Running Models" thread I believe that you are correct. I will try to gather up the information requested and post it there. If this is the case, then there seems to have been a big upturn in the number of these models over the past few days - or maybe I just got lucky. In any case I keep my queue short to try and provide the desired quick turn around so if they determine the problem I will be quick to feel the change. |
Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,552,445 RAC: 0 |
Chris, You are not alone, I have seen a huge increase in the number of long running work units on all of my systems for the last week or two and especially the last few days. I have AMD and Intel CPUs and I am running both Linux and win7 and this occurs on all of the systems. I first noticed it in the ProteinInterfaceDesign work units and I reported it the the long running thread, however today I noticed a gunn_fragments that had run for 10 hours with 2 hours of CPU time so this may be a 2.14 issue. Usually if I suspend then resume the task it will start running again or complete itself within a few seconds. Matt |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
After watching several more of these long running tasks float through my systems today I really have to believe that there is something seriously wrong... I agree, Chris. I'm not obsessive on credits at all, but it does seem to me this isn't working right. With the watchdog kicking in after 3 or 4 hours over-run, and assuming there weren't any long-running models among the initial decoys, I can't see how a watchdog-truncated model will do more than halve the credits on a 34 hour runtime, or a third on an 8 hour runtime. I'm seeing much the same as you with my WUs. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Matt - to put in the words of the typical Texas red neck "Bubba, you be a genuine gene-e-us" OK - what I saw - and this was with the five the "long running" tasks which were currently executing on my systems at the time I read your post. 1 - When I looked at the "long running" task's properties I could see that the last checkpoint was several hours in the past. This was consistent with each and every one of the "long running" tasks I was able to observe. 2 - A "normal" task seems to take a checkpoint every few minutes. I am new to this so I am not sure what the exact cycle is, but the age of the checkpoint seems to be key in identifying tasks which are in trouble. 3 - All five of the "long running" tasks, whose last checkpoint age was greater than an hour or two, were members of the "ProteinInterfaceDesign" family - with a such a small sample size I am not sure if that was the "luck of the draw" or a smoking gun. 4. After being suspended and the resumed like you suggested, they all completed in short order. Interestingly, they were also granted granted credit pretty much in line with what would normally be expected. Now what I suspect, and this is pure speculation with no hard facts to back it up, is that the act of resuming a task generates a checkpoint. When the task starts executing again it determines that it does not have enough time to complete a new model before its run time is up and it calls it a day. I will try and observe the last checkpoint time on the next "long running" task I see. Thanks again for the suggestion. |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
I tried the suspend option on another long-running model earlier today and it didn't end the task shortly after. It carried on until the watchdog kicked in - same as the others. It's certainly worth a try though. I agree it depends on where the last checkpoint came - if there hasn't been one since the problem decoy began it seems to me the task will make the same decision to continue as it did before after going back to the last checkpoint and, more significantly, the time of the last checkpoint. Wall clock time doesn't matter as I understand things. Suspend, unsuspend, hope! ;) |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Yes Ms Lizze -I fear you are right. The suspend / resume thing works once in a while, but not anywhere as near as I wish it would. I have watched these abhorrent tasks squeeze through my system most of the day and the only thing that seems consistent - at least on my systems - is that they have all been members of the ProteinInterfaceDesign group. I have seen about a dozen of them today - one took its last checkpoint as early as thirteen minutes into the run, and others were well past the five hour point when they went south. Now the question I have is: once these work units have gone astray - as denoted by a really stale checkpoint time stamp - are they accomplishing any valid work, or should they be aborted when found? So far they have burned a prodigious amount of CPU cycles, but I have erred on the safe side and allowed them to continue (after a suspend/resume cycle that is) |
Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,552,445 RAC: 0 |
I noticed my laptop which has a three hour run time seems to have less trouble with completing these work units. My higher throughput systems are set to six hours. Also, it seems most of the stuck CPU times are in the two to four hour range. I am going to move one system back to a 2 hour run time and see what happens. Could this be a watchdog issue in combination with some work units that don't crash until later in the run? Definitely not a genius! I just had a rod put in my leg and have a ruptured disc in my back! I am spending way too much time looking at these computers. Matt |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Does anyone know exactly what event triggers a checkpoint? Is it time based or maybe it occurs after the creation of a decoy? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I'll have to catch up on the rest of this thread later, but to take a checkpoint the program much reach key points in the execution where the application has had the necessary code to do so added, and corresponding code to resume from any given checkpoint as well. End of model is always a checkpoint (except perhaps with these newest protocols where thousands of models are produced, I'm not positive on those). But otherwise they had to basically carve out specific points in the run where they were able to implement checkpoints. The frequency of these points varies by the type of task you are working on, the protein being studied, and the protocols being used to study it. The goal is to be taking checkpoints every 10-15 minutes, but not all combinations are there yet. I believe code is in place so that if you happen to hit the portion of code where a checkpoint is possible, and a checkpoint has "just" been taken (within your disk "write at most" setting in the BOINC configuration) that it will not actually write that checkpoint to disk. At least not until after the configured duration has passed, and by that time you've probably hit another checkpoint as well, so it all gets written at once. So if your machine is cranking through the work in record speed, the idea is you still won't burn up your disk drive with IO due to Rosetta. You are probably more concerned about long spans between checkpoints, and as mentioned above, you may be running a lucky long-running model, in a protocol that is unable to checkpoint frequently. I guess long story short, I'm trying to say that checkpoints can be related to time, and also the specifics of the task. But either way, it isn't going to just hit dead on at say a 5 min. interval or whatever. See my recent analogy elsewhere to how that's like walking into a bakery and demanding all baking be stopped and preserved immediately, and then expecting to resume all baking operations and still have edible breads etc. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, the protein interface protocol is one of the more recent additions to Rosetta. And it is performing some very challenging work, so it hasn't matured enough to eliminate more of the long running models, and it may not checkpoint beyond end of model. So, I believe that what you are observing is explained by that, and the fact that this type of protein study has been a more common task recently in the mix of work being processed. As to the suspend/resume theory... this will never trigger a checkpoint. Don't bother sending me a log attempting to prove otherwise. Any such log would only show the power of coincidence. You wouldn't be able to "make" it happen again. You just happened to do it when a checkpoint would have been taken anyway. The suspend/resume thoughts are typically constructive for tasks that have "stalled", that the BOINC Manager says are in a "running" state, but are not actually consuming CPU time. This is a seperate issue. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
Understood Mod.Sense, but I observed the graphics on one proteininterface WU and each model seemed to go through 500 steps before moving to the next model. But when it ran into extended time it seemed to do the same 500 steps, stop a while, then go a further 500 steps over, stop a while etc, over and over without moving to the next model. Only the watchdog seems to end this repetition. Some WUs involve hundreds of models, some thousands, but they eventually seem to hit this glitch at some point before the run-time is up. I imagine that reducing the runtime also reduces the likelihood that the problem model will be reached, so it will help but not eradicate the issue. I don't know if this helps but perhaps it's a clue that strikes a chord with the backroom guys. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Here is a pair of tasks, one generated 2 decoys, the other 8, same run times ... but look at the difference in granted credit... 0.69 vs. 53 ... I don't know about you ... but, the methodology being used has issues ... same amount of calculation time, factor of almost 100 difference in award? I mean, I could buy the explanation if the variance was a factor of 4 with one doing 2 decoys and the other 8 ... but ... sorry, the explanation does not hold water ... one of the reasons I had deprecated my involvement in Rosetta ... which after I push it over 1M I suspect I will do again ... I grant you credits are worthless, but fairness is not ... |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
Here is a pair of tasks, one generated 2 decoys, the other 8, same run times ... but look at the difference in granted credit... 0.69 vs. 53 ... In fairness, they aren't the same kind of tasks. A fairer comparison would be: int2_centerfirst2b_1fAc_2os0_ProteinInterfaceDesign_23May2010_21231_93_0 CPU time 33817.58 121 decoys 30.619855354901 = 0.25 creditsdecoy int2_centerfirst2b_1fAc_2pxx_ProteinInterfaceDesign_23May2010_21231_6_1 CPU time 43486.92 (watchdog) 24 decoys 10.7476946331734 = 0.45 creditsdecoy int2_centerfirst2b_1fAc_1y9q_ProteinInterfaceDesign_23May2010_21231_55_1 CPU time 28748.19 470 decoys 165.786482361702 = 0.35 creditsdecoy All 3 tasks reported on the same day from the same machine. Oddly, the lowest total credit and longest run was also the highest creditdecoy. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Mod Sense - Thank you for taking the time to respond and and provide your insight. I guess at this point my bottom line question goes something like this: I currently have a task on one of my systems - at this point its CPU time is 04:34:07. The last checkpoint it took was at 00:09:32 - or almost four and a half hours ago. Just to make sure I communicated that correctly - no checkpoint in four and a half hours. Now it would seem that we are presented with two basic scenarios here: 1. It has encountered a particularly "challenging" set of numbers and like Mike Mulligan's little steam shovel, it is resolutely working it way through the task at hand and is producing valid data. 2. It is sitting over in the corner, spinning its wheels, drooling on itself, not accomplishing anything of value. Like maybe somehow a loop counter got decremented past zero (not that I ever got bit by using "=" instead of "<=" - cough, cough) Now, if it is scenario #1 - then I will just sit back and enjoy a bit of Glenlivet 18 and watch the show. I don't give a set of "Rat's Bifocals" about the credit, but I will admit to a personality flaw that causes me to twitch uncontrollably when I suspect that something isn't right in Gotham City. However, if we are looking at scenario #2 maybe its time to terminate the task and get back to producing valid data for the good doctor. In order to avoid slaughtering innocent bystanders I would propose using the criteria of the last checkpoint being more than an hour old as the discriminator. As a point of reference, I would estimate - from the gut - that I burned 15 to 20 percent of my total CPU time on these "unique" "long running" tasks over the past few days. Thanks for all the good work you do, and for cheerfully putting up with grumpy old men such as myself. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
I currently have a task on one of my systems - at this point its CPU time is 04:34:07. As long as the CPU time is clocking up you have to trust it (though I perfectly understand why you wouldn't). Only if the CPU time isn't moving forward can you be sure it's gone wrong. I also assume that the watchdog shuts the task down in a cleaner way than aborting would. Also, however many models completed successfully before the long-running one started would be made use of (as I understand things) if the watchdog kicks in, but they all get trashed if you abort. Thought with a last checkpoint at 9 mins that wouldn't amount to much. I'm sure someone will tell me if my assumptions here are wrong. |
Message boards :
Number crunching :
Why so much variation?
©2024 University of Washington
https://www.bakerlab.org