Message boards : Number crunching : WU run times out of whack
Author | Message |
---|---|
Nemesis Send message Joined: 12 Mar 06 Posts: 149 Credit: 21,395 RAC: 0 |
My runtime preference is set to 3600 seconds - 1 hour. Now I'm getting all these WUs that want to run waaaay in excess of that, like this one: 1gidA_BOINC_RNA_ABINITIO_RNA_CONTACT_RNA_LONG_RANGE_CONTACT_RNA_SASA-1gidA-_1634_628_1 Not only is this thing taking 2 and a half hours at least, it's also driving up my DCF which in turn limits downloads by artificially increasing run time estimates for ALL WUs. I'm going to let this one finish, but abort anything else in my queue that looks vaguely similar. Nemesis n. A righteous infliction of retribution manifested by an appropriate agent. |
Sailor Send message Joined: 19 Mar 07 Posts: 75 Credit: 89,192 RAC: 0 |
It has to run atleast through 1 Model to finish a workunit. When 1 model takes 90/120 mins to finish, then its clear, why that workunit cant finish in the given target time. Check the graphic to see, if its running still model 1, im pretty sure thats causing it. |
Nemesis Send message Joined: 12 Mar 06 Posts: 149 Credit: 21,395 RAC: 0 |
I *KNOW* that - my point is why do I get sent work that is outside my stated run-time preference, and not just by a little, but many multiples of the preference setting? The server knows that number, so it shouldn't send long WUs. Nemesis n. A righteous infliction of retribution manifested by an appropriate agent. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I see clearly your suggestion that more precise estimates can potentially incorporated into tasks, and perhaps even create work dynamically that is customized for each user's runtime preference. But, at this point, that isn't how it works. The tasks are all precreated and a consistent runtime provided for each so that your DCF adjusts to become inline with your runtime preference (in most cases). As you point out, this approach has it's flaws. And when the smallest possible runtime preference is selected, the DCF calculations may actually work against you more then settling in, due to a comparitively high variability in your runtimes as compared to your runtime preference. So, given that the system has these areas leaving room for improvement, and given that your primary concern seems to be accurate scheduling and downloading of work... why not increase your runtime preference? This will help you see completion times that are much more consistent with your higher target runtime. Your DCF will settle in and accurately maintain your cache of work to be the size you like. I am assuming you probably have some additional consideration of some kind that leads you to chose the one hour. So, it's an honest question. Perhaps understanding the factors in your reply can help devise an approach you will find to your likeing. Also, Rhiju has promised! that work units that run for over an hour per model will become increasingly rare in the future in his post here on Ralph where he says: ...we do *not* plan to send workunits to Rosetta@home that take more than an hour per decoy! ...but he did not clarify, an hour on what speed machine? These days, your hour might be my whole Saturday! ...but the mix of work units will bring faster models. And better checkpointing is coming as well, so even if the model is not completed, at least work won't be lost when transitioning to another project, or shutting down the machine. Rosetta Moderator: Mod.Sense |
Nemesis Send message Joined: 12 Mar 06 Posts: 149 Credit: 21,395 RAC: 0 |
My main consideration for wanting short runtimes in this project is the lack of checkpointing *within* a model. I don't want to have a machine go off and lose 2, 3, 4 hours of work. I also strongly believe that distributed computing in general needs to keep WUs short. The current trend of ever-increasing times (not just in this project) is counter to the idea of distributing small units of work to many machines. Also, within the BOINC framework, task switching between projects becomes easier and safer with shorter WUs. There is much less risk of a large WU held in memory becoming corrupted or lost and wasting hours of CPU cycles. The "pre-created tasks here with a consistent runtime" (your words) should give the server the option to not issue long WUs to users with short runtime preferences. If I have a 10 day queue and suddenly the WUs are much larger, the remainder of the WUs in the queue are in danger of missing deadlines. Suddenly increasing DCFs also don't play nice with large work queues. This is a BOINC problem, since all WUs from the same project get the same estimated time to complete applied to them. If my estimates jump from 1 hour to 2.7 hours as they did today, but only a few of those are long, my queue is no longer filled to 10 days, but considerably less. Nemesis n. A righteous infliction of retribution manifested by an appropriate agent. |
Sailor Send message Joined: 19 Mar 07 Posts: 75 Credit: 89,192 RAC: 0 |
I dont get the "loss of work" thing... It never saves within a model, so it makes no difference if your machine goes down with a workunit of 1 hour or 24 hours. Im using 8 hours target time, and i shut down my PC quite a lot, or just shut down BOINC to game. Sure i lose the actual model, so i might drop from 7h30 CPU time to 7h00 or even 6h30 but erm... if it would be a 1 hour WU i would drop to 0h00 so there is no difference. Also, mind if I ask why you are running 1 hour WU´s with a work cache of 10 days? I dont see any sense in there, and ofc a low target time will be more unpredictable there.. Im running 8 hours + 1 days cache, so when the rosetta servers are unable to send out work, like some days ago, I set the target time to 24 hours, and I got plenty of work for the next 4 days ;) A 10 days work cache to me would only make sense, if you have a very reduced possibility of connecting to the net. And if that is so, there is no difference how long one WU is crunching - correct me if im wrong. PS: If u call 1 or 3 hour wu´s large, now what are the climateprediction wu`s then ? =P |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
My main consideration for wanting short runtimes in this project is the lack of checkpointing *within* a model. I don't want to have a machine go off and lose 2, 3, 4 hours of work. This is one focal point of the Rosetta release that is presently under development. More frequent checkpoints will allow BOINC more flexibility in rotating amongst projects as well, as I believe the new BOINC version has a strong preference to only switch when a task reaches a checkpoint. This will help debt remain near zero as well, another advantage. The next release is only a few weeks away. But until then, if you set your cache size down lower for a short time, then ratchet your WU runtime preference up to even the 3 hour default (or higher), then once DCF adjusts and predicts new tasks will take about 3hrs (or the new runtime preference), then set cache as desired... then you'll be quite happy with the accuracy of the completion times and the cache size. Setting the runtime preference higher will not risk loseing any work that is not already being lost. So there's really nothing to lose in trying a longer runtime. Rosetta Moderator: Mod.Sense |
rechenknecht123 Send message Joined: 15 Oct 06 Posts: 17 Credit: 2,022 RAC: 0 |
lost 4 WU- times as en error of my mac os x 10.3.9 installation ppc g4 400MHZ . wu estimatet time ca. 4-5h , real chruching time 13h cpu time - 1 day in liftime- 3 checkponit nesccery at my opinion at 25%,50%,75% ready of work. so it is not so userfriendly. lost 2 seti WU as time overdue of the SETI-WUs. grettings rechenknecht My main consideration for wanting short runtimes in this project is the lack of checkpointing *within* a model. I don't want to have a machine go off and lose 2, 3, 4 hours of work. |
Nemesis Send message Joined: 12 Mar 06 Posts: 149 Credit: 21,395 RAC: 0 |
More WUs over my run-time preference today. I think it's time to abort the lot and wait for this to get fixed. Nemesis n. A righteous infliction of retribution manifested by an appropriate agent. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
...and of course, he hides his machines, so we can't see if his 3600 second preference is exceeded by 100 seconds or 1000 seconds. Just FYI, I've not seen anything indicating they are working on any "fix" for your problem with your 1hr runtime preference. Just that they are working on better checkpointing. But I can suggest an EASY fix! They stop allowing a preference less then 3 hours. It would really have a number of advantages all around. It would be nice roundish 10000 seconds. It would reduce bandwidth on the servers and number of tasks to track throughout all the databases. It would make the initial WUs downloaded match their presumed initial time adjustment factor. It would make the scheduler better able to judge how many tasks to ask for to keep your queue filled to the desired level, and it would really be simple to just eliminate 1 and 2 hour preferences from the dropdown list. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Nemesis Send message Joined: 12 Mar 06 Posts: 149 Credit: 21,395 RAC: 0 |
...and of course, he hides his machines, so we can't see if his 3600 second preference is exceeded by 100 seconds or 1000 seconds. A. More than 2000 seconds. Hiding my machines is my valid choice, and none of your business. I didn't ask for *your* help. B. You just have to read this thread (message 39213 in case you need help) C. See B. No need. The client just needs to honor the cruncher's wishes, set via preferences. Nemesis n. A righteous infliction of retribution manifested by an appropriate agent. |
Sailor Send message Joined: 19 Mar 07 Posts: 75 Credit: 89,192 RAC: 0 |
I ont know whats your point on posting here. PPl are giving you input & opinions, and you go aggressive on them. Seems your only reason to post is to whine a bit. Bet your running a 200 MHz PII and wondering why 1 model is exeeding 1 hour CPU time. rofl. -1 for these posts, now stop whining or make some constructive posts. I totaly agree with Feet1st, nobody really needs 1 and 2 hours target time, just remove these options. |
Nemesis Send message Joined: 12 Mar 06 Posts: 149 Credit: 21,395 RAC: 0 |
I ont know whats your point on posting here. PPl are giving you input & opinions, and you go aggressive on them. Seems your only reason to post is to whine a bit. Bet your running a 200 MHz PII and wondering why 1 model is exeeding 1 hour CPU time. rofl. I'm posting to let the project know they still have not fixed the problem of exceeding the run-time preferences. I'm not asking for input from people who don't have "Developer" or "Programmer" or "Scientist" next to their name. I've been in Distributed Computing far longer than most of them. I don't run projects that have long work units. Period. FWIW (and I don't know why it's any of your business), the PCs that I crunch with are all high end AMD or server-class multi-cpu Intel Xeons. Edit: Sailor, you're now the first in my plonk file. Nemesis n. A righteous infliction of retribution manifested by an appropriate agent. |
Conan Send message Joined: 11 Oct 05 Posts: 151 Credit: 4,244,078 RAC: 2,057 |
I too have a problem with preferences not being honoured. I have a preference of 6 hours and on one of my Linux machines the Boinc Manager keeps saying the work unit is going to take 7 to 8 hours, although it still completes close to the 6 hour mark so not the same problem as Nemesis. Boinc manager does not lower the next run time as each new work unit still shows well over 7 hours. It is only affecting 1 computer my others all show the correct 6 hour preference. |
Nemesis Send message Joined: 12 Mar 06 Posts: 149 Credit: 21,395 RAC: 0 |
The project client needs to honor the Run Time Preference as vigorously as any other BOINC or project preference, such as memory or disk usage, keep in memory, work while busy, or any of the others. I provided a detailed methodology on how to accomplish this in a previous thread, but it's obviously been ignored and has not been implemented. So - they'll just have to live with a few hundred WUs aborted every time I see them sending my DCF through the roof and the estimated run times exceeding 1:00:00. Nemesis n. A righteous infliction of retribution manifested by an appropriate agent. |
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
So - they'll just have to live with a few hundred WUs aborted every time I see them sending my DCF through the roof and the estimated run times exceeding 1:00:00. I hope they ban you from the project. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Nemesis, I agree. We need to make sure experiments do not have run times per model greater than the minimum preference. We're working on reducing the run times for some of the recent experiments with long run times. We are also adding more checkpointing. These improvements should be in the next app release. There may be some experiments in the future that need long run times per model and if they are run on a regular basis (which none have been so far to my knowledge, just a batch here and there) we might have to consider increasing the minimum run time pref. We'll alert users if we plan on doing this. A 3 hour minimum seems reasonable. Another option would be to modify the submission procedure/system and the scheduler to take run time/model estimates (either manually from the scientist or from ralph) and then make sure the scheduler sends wus to clients with the appropriate run time preference. But this wouldn't be perfect due to the variability in run times even within the same work unit and it's more work so I'd lean towards a longer minimum pref. It's okay to abort work units since they will either be picked up by another or more will be created if necessary but aborting is a hassle that volunteer crunchers shouldn't have to deal with and the correct thing for us to do would be to make sure our app honors the run time pref as you have stated. |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
This debate was moving in the wrong direction (cf rules point 4 and 5), and I am quite relieved to observe that all the moderators approach the question of runtime preferences in a sensible way. I have also observed this problem, but knowing that more checkpointing is expected in the next version of Rosetta I decided that this was no big issue at present. In my case a short runtime preference is selected because of insufficient checkpointing. With checkpointing every 15 minutes I should probably increase my runtime preference from 2 to 6 or 8 hours since my work loss when shutting down the computer then will be negligible. It follows that I look forward to see what frequency is selected - it certainly will influence the way I shall handle the project. -- R. A. Mostol |
Sailor Send message Joined: 19 Mar 07 Posts: 75 Credit: 89,192 RAC: 0 |
In my case a short runtime preference is selected because of insufficient checkpointing. With checkpointing every 15 minutes I should probably increase my runtime preference from 2 to 6 or 8 hours since my work loss when shutting down the computer then will be negligible. You are wrong on this point, ill try to explain. Rosetta saves the progress after each model finished. Im running 14 hours atm, dont ask why, just for the fun. Yesterday I went to play a F.E.A.R. match so I closed the application with a runtime of 8:34:XX - Started again after it, and it started with 8:32:XX - Why? I passed the moment where a model finished. This is the SAME situation for every runtime, it makes no difference... If i would have closed BEFORE the 8:32 finished model, i may have dropped to 7:55:00 or something, but a 1 hour WU would have been at 0:37:00 there, and would have dropped to 0:00:00 - can u follow me ? You WONT lose 6 hours work, when closing BOINC with a 8 hour workunit close before the end, it will resume from the last finished model, usualy around 30 mins on my machines are needed per model, ofc depending on the WU. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
To try and simplify what Sailor is saying... preferred runtime does not effect when checkpointing occurs. We all get just as many checkpoints, regardless of the runtime preference. If a task runs for 8 hours, the machine is rebooted and then it restarts, you will see it's complete % reset to zero, but it's time spent still increasing from the checkpoint prior to the point when the machine was powered off. This is a flaw in the current completion % reported, not an indication that 8hrs of work was lost. I'm sure Rhiju has been looking in to that issue as well for the work on the next release. If you note, about half of the "Problems with 5.59" thread is people confused by and reporting this quirk in the new changes for the more precise % complete. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
WU run times out of whack
©2024 University of Washington
https://www.bakerlab.org