Message boards : Number crunching : Shorter WU deadlines
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Nightbird Send message Joined: 17 Sep 05 Posts: 70 Credit: 32,418 RAC: 0 |
@ David Baker another change is that the maximum work unit length has been increased to eliminate (hopefully!) the time out problem. Ehm, can you explain ? not sure that i understand here |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out. ==== edit Results that would never complete are what this is ment to address. The problem is that if the time to complete/# operations to complete is highly variable choosing the "right" value is difficult. |
Nightbird Send message Joined: 17 Sep 05 Posts: 70 Credit: 32,418 RAC: 0 |
When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out. Thanks for your answer :) but OPS ? operations, i guess and a new acronym for your "BOINC Acronyms" ;) and how is calculated the max time ? |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out. Thanks for your answer :) but OPS ? operations, i guess and a new acronym for your "BOINC Acronyms" ;) [/quote] Ops is a fairly standard abbreviation within computing. Ambiguously it can stand for 'operations' or 'operations per second' - you have to figure out which from the context. Also Flops, which depending on context can mean fl[/]oating point [u]operations, or floating point operation per second. and how is calculated the max time ? The project specifies a max number of ops for each workunit (or is it each type of WU?), and the client uses the benchmark to turn this into a max cpu time. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
In theory yoiu can specify IntOPS and FFLOPS separately, few project do a good study of this to set the proportions. Almost all concentrate on FLOPS only. So, I was generic Operations per Second encompassing both floating point operations and ineger operations. If you dig deep you can also find that there are other classes of number systems used and each of these will also have characteristic speeds. These systems include: BCD - Binary coded decimal, used for exact decimal math Boolean - though usually lumped with integer, can be slightly different in speed, usually not enought to matter (which is why it is lumped with integer) Fixed Point - which is another number encoding scheme. I know that there are a couple of others that are even less common but can't remmeber them off the top of my head. And, FLOPS is actually sometimes different if you talk single precision vs. double precision. WIth IEEE 754 complient systems this is usually not an issue as the FPUs use 80-bit internal representations (usually) and so the only difference is in the final output result. This is another reason that optimizing code can change the output values. If I stay in 80-bit precision over more operations my numerical error propagation is reduced as I have more digits of accuracy. If I am pulling the numbers out of the FPU and converting them back to single precision and then back, well, the values will TEND to have slightly more error. Not an issue in most cases. However in iterative systems minor changes in the accumulation of error can give amazing differences in final outputs because of these seemingly trivial changes. There are some good references in the Wiki ... look up floating point numbers. I try to summarize how a lot of this "works", though I am sure that the simplification makes it rather incorrect in detail. The difficulties lie in the fact that floating point is a scheme that encodes numbers. It can only precicely define certain values. All other values cannot be represented (shown in the Wiki example). More interesting is the fact the the intervals between numbers that arer representable are not constant. Depening where you are on the number line drives the "distance". Factors like these easily catch people off guard, including many of us that should know better. What is more distressing is the fact that many scientists do not really understand the fundamental issues and how they can affect the system they are coding. Part of these issues are what drove the dabate between myself and Jack S. with regard to the random number generator used. I worked with a mathematician once on a system that derived a polynomial that represented the curves represented by a matrix of numbers. Over the decade I worked with him, never really understanding what the heck we were doing, I developed an acute sensitivity to the questions around and about as they are used in iterative systems. With a RND Gen you have two concerns a) period, and b) distribution ... An eample of period is the Apple II where the RND function was believed to have a period of 1G or so ... it turns out to have been roughly 17,000 before it began to repeat itself. One of the reaons many games on the early Apples were not that exciting ... :) anyway, more than you wanted to know I am sure ... |
Nightbird Send message Joined: 17 Sep 05 Posts: 70 Credit: 32,418 RAC: 0 |
anyway, more than you wanted to know I am sure ... Indeed but i don't have the feeling that I understand you. ;) |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out. The max time is set by the project. It is a parameter that is part of the WU that is sent to your computer. It is only a Maximum number, and is expressed in number of operations. It really does not matter too much in human terms if it is CPU cycles or floating point calculations. The system keeps count of which ever it is and when it hits the number set by the project, it aborts the WU with a Max time error. The project sets the value by giving their "best estimate" of the time a WU should run, and then converts the value to operations per second, or flops, or what ever you want to call them. As a result of having a lot of Max time errors the project recently raised the Max time setting (bounds) for many WU types, and the Max time errors have almost stopped. I have only seen one in the last week. Before they made the adjustment I was seeing 5 or 6 a day. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0 |
In normal running (when projects are run in Round Robin mode) the cache for each project does indeed run first-in-first-out. Two comments: (1) Is this information (and the earlier explanation that EDF mode is entered if the deadline for a WU is more than half the cache size) __available__ to people who want to join the project ? I came to Rosetta from DF; if I had known that Rosetta would start playing "priority games" I would probably not have joined. (2) I myself RESENT that now with 7-day WUs I would have to set my "cache size" (actually, the interval between connects) to THREE days to avoid EDF mode. I run offline; I have very little liking for a project that now FORCES me to connect every three days. WHY force __participating volunteers__ to "jump" as soon as the project comes up with a "bright idea" ? To me, a more reasonable way to run a railroad would be to make longer-deadline WUs still available, and to set up the download process such that for users specifying an interval between connects of 7 days or more, the WUs preferentially downloaded would be such that the client is __not__ precipitated into EDF mode. mikus -------- p.s. I intend to run only one BOINC project per computer. My computers are normally off-line. The computer on which I run Rosetta has a specified 'time to connect' of eight days. With the new 7-day WUs, that computer is put into EDF mode as soon as the first WU starts downloading. But the only project *is* Rosettta, and *all* Rosetta WUs currently have seven-day deadlines. On my computer EDF mode accomplishes NOTHING (except aggravation). . |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Two comments: The information is available from a range of sources. The easiest one to point out is the WIKI. There is a link to the WIKI from the project home page. I have no doubt that the situation you describe is occurring on your systems. But it is caused by the way in which you have chosen to operate your machines. BOINC was never designed as a single project system. The whole point of BOINC is to share otherwise unused cycles between a number of projects. While it is possible to run BOINC with intermittent network connections, the system works best with a permanent connection. The choices made by the project as to reporting deadlines are being made because there is a reason for them. This has been discussed in the science forum. With a project such as this, the project team has to run the project by balancing what is required by the science, against how the majority of the user community is configured. This by definition means that there will be systems and configurations at either end of the spectrum that will have significant inconvenience, or be unable to run the project. The vast majority of the users of R@H have no difficulty at all. They set the system up, adjust their preferences, and the thing just runs. On occasion, a WU will fail, or some other transient problem occurs, but this is just the nature of the beast and they take it in stride. It seems as though for your setup a project like Einstein, or predictor may be more appropriate. But considering that there are somewhere around 36,000 users and most of them are not having significant difficulty, it is unlikely that a lot of changes would be made to accommodate 50 or even a few hundred users who wish to operate their systems near the outer edges of the normal BOINC environment. Your problem could be solved by just setting the connection interval to 6 days, letting the system connect when it needs to connect, and let the system run itself. If you have only the one modem, the perhaps a $50 LinkSys switch would allow you to network your system and they could share the connection. The fact that the system enters EDF mode when it is only running one project has no impact whatsoever in practical terms. If none of this will wok for you then I think you have decided that R@H won't work for you. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0 |
Not to burst your bubble but I just downloaded some WU's with 30 day deadlines. So not all WU's have the 7 day deadline. And yes they are some of the earlier WU's but because of the 30 day turnaround it took them that long to cycle back through. |
nasher Send message Joined: 5 Nov 05 Posts: 98 Credit: 618,288 RAC: 0 |
Well being that i am curently deployed and cant check on my computers besides lookin at results sent by all projects they seem to me to be running fine.. as for the short 7 day work units . in another post they told us that there were some jobs they needed results quicker than normal and that they would be going back to the longer times later. I understand that from time to time there is a reason that a project may need to get a certain job or set of WU's done quickly.. for instance when Pirates put out its last set of jobs they ran about 5-50 min and had a 6 hour deadline since they needed answers now. corse people runnin pirates knew this. i think a 7 day turnaround for some WU's sometimes is reasonable, but i hope it dosnt become the NORM... if it is now and again i have no problem with it since it will crunch my jobs as i desire without me adjusting schedules.. oh another thing if you have it set to get 8 days or work of corse its going to grab alot of Rosetta or whatever you ask it to connect to.. then since the rosetta WU's you grabed were of the 7 day variety yes it went to EOD timeing .. personaly i have myself setup for .1-.5 days dependin on the machine and whats its used for.. this tends to keep at least 2-3 jobs of one project or another and not put me into EOD that often. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
EDF (earliest deadine first) isn't an error. It's just an explanation of what the scheduler is doing. It still keeps track of all project resource shares by the amount of time spent, and balances the work done long term by this number. EDF just means you'll do a bunch of one, then switch to another, then switch again. All the time it will balance out the resource share you've requested. In EDF it just doesn't happen "intraday", but can take days/weeks to balance. There is nothing wrong with running in EDF mode. Amongst us Boinc Alphatesters and Developers we are currently discussing name changes to get people to understand that "overcommitted" and "edf" aren't bad. Thus far we haven't decided on the proper terminology, or methodology to achieve this. |
Los Alcoholicos~La Muis Send message Joined: 4 Nov 05 Posts: 34 Credit: 1,041,724 RAC: 0 |
EDF maybe a feature but it doesn't work properly with the variations in wu sizes of R@h. This computer (only R@h, 24/7) got 18 PRODUCTION_ABINITIO wu's on 01-18 with the deadline of 02-15 (these wu's take over 10 hours to finish). Then R@h started to send out these short deadline wu's. Thanks to the EDF this computer did the earliest deadline first (about 150 wu's) and started only yesterday with these big wu's. It finished 2 so far. So it will probality finish another 3 wu's in time. Boinc then will let it work on wu's which deadline is already passed. And since this computer is located elsewhere I have to travel 60 miles to abort those wu's to prevent this computer from wasting about 150 hours CPU time. |
accessys Send message Joined: 3 Sep 24 Posts: 1 Credit: 317,866 RAC: 1,712 |
your deadlines are so short that my computer often gets about 90% of the work done then times out. a lot of wasted computer usage time. I am running more than one project and none of the others time out. they mostly have smaller WUs and thus get finished. but I use computer a lot and if it goes over 75% my usage it suspends BOINC. but I am on the verge of withdrawing from Rosetta even though I think it is a very important and useful project but I am wasting computer resources I just checked statistics and time expired on 4 WUs that were over 75% completed what a waste Bob |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,454,639 RAC: 5,933 |
your deadlines are so short that my computer often gets about 90% of the work done then times out Not my experience. Mine usually are set to take 8 hours, and they complete on time My main machine: CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Coprocessors --- Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.10 (Ootpa) [4.18.0-553.22.1.el8_10.x86_64|libc 2.28] BOINC version 7.20.2 Memory 128085.97 MB Cache 16896 KB Swap space 15992 MB Total disk space 488.04 GB Free Disk Space 479.37 GB Here are two recent ones: 1584293409 1409564820 5 Oct 2024, 20:24:18 UTC 7 Oct 2024, 9:35:43 UTC Completed and validated 28,281.62 27,959.69 546.71 Rosetta v4.20 x86_64-pc-linux-gnu 1584293410 1409564822 5 Oct 2024, 20:24:18 UTC 7 Oct 2024, 6:04:54 UTC Completed and validated 28,544.62 28,078.41 428.56 Rosetta v4.20 x86_64-pc-linux-gnu |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
your deadlines are so short that my computer often gets about 90% of the work done then times out. There are two ways to look at this, Bob. Either the deadlines are too short, or the cache size you chose to call down across all your projects is so large it makes you incapable of meeting all their deadlines right from the moment you ask for so many. You say your other project tasks have comparatively short task-times, so can you explain why you feel the need to call down (I'm estimating) three entire days of tasks across all your 16 cores as an offline backup? With short runtimes, there's no need to ask for more than a total of one day's worth of tasks in total across all your projects in the two Boinc settings that control your offline cache. In fact, some here recommend no more than 0.25 of a day In BoincComputing Preferences, on the Computing tab, my Boinc settings are "Store at least 0.5 days and up to an additional 0.1 days of work" and there are never any problems If you were to reduce your settings to somewhere much nearer that, but a total of 1.0 days ought to be fine too, even if you suspend Boinc if usage is over 75%, you still ought to be able to solve all the problems you're having and complete and get credit for them all too. You have to appreciate that every project is different and sets deadlines according to its own specific needs, None of them will <ever> compromise their own project by allowing users to determine their needs. That would be ludicrous. That's why you need to adapt - because you're not controlling their projects, they are. So look at what all your preferred projects say they need and adjust your settings to that which allows them all to be successful, not cause them to fail in a way that suits neither them nor you. Then you can successfully run every project you feel is important without the regret you expressed above. Hope that helps |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,937,065 RAC: 22,782 |
I just checked statistics and time expired on 4 WUs that were over 75% completed what a wasteAnd i just looked at your Tasks, and there is no sign of that happening. Every Task you started you returned and got Credit for, except for one which was a Validation error. The only problem is your excessive cache. If you were running multiple projects (there is no sign of that occurring on the account that you are using here at Rosetta), then there is no need for a cache at all- if one project is out of work, the other projects will keep your system busing till the one without work has work again. No need for any sort of cache at all. The only reason for having a cache is if you are doing work for just one project, and that project has lots of problems with it's servers sending & receiving work. Or for it's original function- when people were using dial up and had to pay every time they had to connect to the net. So given that most systems now have almost constant connections to the net, there is no need for a cache even with just one reliable project. Set it so there is enough work so that the system doesn't have to wait to get more work once one Task has finished. No need for more than that. If the project is flaky, then set a cache large enough to keep your system busy, but within the Deadlines for the work you are processing. Here at Rosetta it is 3 days, so any cache should be less than that. eg- Store at least 2.85 days of work Store up to an additional 0.01 days of workIf you really think you have to have a cache. Buit really people are better off with no cache at all, especially if they are doing more than one project. eg- Store at least 0.05 days of work Store up to an additional 0.01 days of work Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
I just checked statistics and time expired on 4 WUs that were over 75% completed - what a wasteIf the project is flaky, then set a cache large enough to keep your system busy, but within the Deadlines for the work you are processing. If I'm not mistaken, the deadline includes the runtime of Rosetta's tasks and at least needs an allowance for how long it will be before the task starts running after being downloaded. So the absolute maximum needs to be 3 days minus 8hrs, minus an estimate of how long it will be before the task begins running. So, certainly less than 2.66 and probably 2.5 just to allow for some leeway. But note that Bob also says, perfectly legitimately if that's his preference, that he suspends Boinc if non-Boinc work utilises more than 75% CPU activity, so I'd be pretty uncomfortable suggesting a maximum cache size any more than 2.0. And if way more than half his tasks are failing to even start before passing deadline, and he says 75% aren't completing before deadline, that takes me back down to the 1.0 I suggested before. Whatever figure is used, we agree that it's the cache size that's the issue - and adjusting it appropriately is the way around the fact Rosetta's deadlines are and will remain 3 days. As a side note, we were told here the reason for the comparatively short deadlines is that Rosetta is a live project where the result of tasks matter. Results from a batch of work are looked at and often affect parameters of the next batch of work issued. For a large number of other projects (who will remain nameless) it doesn't matter if deadlines are a week, month or a year. It's not going to make a difference to anything. Cache sizes should always be tailored to the project with the shortest deadlines, not neccessarily the 'most important' project, but in this case Rosetta is both. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,937,065 RAC: 22,782 |
But note that Bob also says, perfectly legitimately if that's his preference, that he suspends Boinc if non-Boinc work utilises more than 75% CPU activity, so I'd be pretty uncomfortable suggesting a maximum cache size any more than 2.0.While having a higher cache value will cause problems initially when first running BOINC, as long as it's been running for a while, it takes in to account how much time BOINC is running, and when it is running how much time it is able to process work for determining cache size- and whether or not long running Tasks will be done in time or not (but those depend greatly on how accurate the initial Estimated times are compared to the actual Runtime, which with Rosetta can vary hugely), so less than 2 days would be prudent. If you look at the details for any of your computers (only you can see these details for your systems- no one else can see them). There's the "Fraction of time BOINC is running" and then there's "While BOINC is running, fraction of time computing is allowed" which are used to determine just how much time is actually available for processing BOINC work. The lower each of those values are, then the less time there is available to BOINC to process work. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
But note that Bob also says, perfectly legitimately if that's his preference, that he suspends Boinc if non-Boinc work utilises more than 75% CPU activity, so I'd be pretty uncomfortable suggesting a maximum cache size any more than 2.0.While having a higher cache value will cause problems initially when first running BOINC, as long as it's been running for a while, it takes in to account how much time BOINC is running, and when it is running how much time it is able to process work for determining cache size- and whether or not long running Tasks will be done in time or not (but those depend greatly on how accurate the initial Estimated times are compared to the actual Runtime, which with Rosetta can vary hugely), so less than 2 days would be prudent. This is a fair point, which I'd forgotten tbh... Now I look closer, I notice Bob only joined on 3 Sept 2024 and the reason so many tasks came down only to miss deadline may actually <not> be due to cache size but the way Boinc goes a bit crazy on initial task downloads as it hasn't established a pattern of usage yet. Which may then mean we've both unfairly maligned him here. Apologies if that is the case (oops) |
Message boards :
Number crunching :
Shorter WU deadlines
©2024 University of Washington
https://www.bakerlab.org