Message boards : Number crunching : Job Queue
Author | Message |
---|---|
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Is there a simple way to have BOINC "forget" whatever parameters it uses in determining how many tasks to keep in the queue to satisfy number of days work to keep on hand? I ask this in the hope that by starting over he might be a little more consistent. For example, I have a few systems which, although my setting is 3 days, only keep about 4 or 5 tasks in the queue. But I also have a few systems which keep about 80 tasks in the queue. All are quad or hex core Phenom II processors except for my xeon in the Mac Pro. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,178,626 RAC: 3,201 |
Is there a simple way to have BOINC "forget" whatever parameters it uses in determining how many tasks to keep in the queue to satisfy number of days work to keep on hand? I do the adjustments in the Boinc Manager which affects only that pc, so I can fine tune how many tasks I get. But you should also look in the Boinc Manager on that pc and check out the projects tab and then hilite Rosie and then click on properties on the left and all the way at the bottom is a "Duration correction factor" and a number, what is the number? The higher it is the longer it thinks a task will take and therefore the fewer tasks that machine will get. Each machine does its own thing as to how many units to get, with Boinc doing all the calculations on its own. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
OK - as much as it pains me to admit you are right, I'll concede on this one Biscuit Boy. There is definitely the correlation you described between the "correction factor" and the number of jobs in the work queue. However, I thought that these numbers were smoothed and adjusted on a regular basis - with the exception of the occasional reboot for patches these systems have been running 24/7 for the past few months with no changes in hardware. How do you get these numbers to more closely match the capacity of the hardware? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It is one of those things where constant adjustment is not always ideal. If a few tasks in a row fail for some reason, the DCF can get thrown off by thinking tasks finish in less time. If a few tasks hit long-running models, DCF can get thrown off by thinking tasks take a long time to run (and therefore it doesn't take as many to keep the machine busy for an X hour buffer). Correct me if I'm wrong, but I've always assumed that I can observe the side effect of DCF being off by looking at the estimated time to completion of a task that has not been started yet, and if that is not in-line with my runtime preference, then I know work fetch will be off, and that DCF is not reflecting the machine's ability to complete the work. It will adjust itself again over time, but at the moment, it's off. Rosetta Moderator: Mod.Sense |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Mod.Sense - thanks for the response - you are partially right - but I don't think we have hit the mark yet. Let's examine "popeye" - my desktop system which also crunches numbers when I type real slow. CPU ID 127776 Jobs sitting in the "Ready to start" state show a projected run time of about 7:33 My preferences are set to 4 hour execution time. I have requested three days worth of jobs to be kept in the local queue. If you look at the output of a typical task - 37692446 - you will see my run_time_preference set to 14400 - or four hours if you prefer. The output of the tasks shows it completing in 14276 seconds - right on schedule. This is a 3.4 ghz quad core system. It typically has right around 4 or 5 jobs in the "Ready to start" state. For Rosetta it has a DFC of 1.8885 Logically speaking on this system, a quad core with a four hour run time and a 3 day work queue I should have about 72 jobs in the queue - give or take. (4 cores x 6 4-hour tasks per day x 3 days) Even with a projected run time of 7.5 hours it does not even cover a half day of work. The only adjustments I have made to the system have been via the Rosetta web site - during the last round of server problems I set my jobs to run at 12 hours just to stay "employed" - they were returned to 4 hours when the servers started functioning normally again. I have never done a manual edit of any Rosetta / BOINC file. The under-populated work queue has been an issue for months. I have the 6.10.56 version of BOINC for Linux running. This is only a problem on a few of my systems and is no big deal as long as the Rosetta servers continue to run pretty much error free. Thanks for any suggestions |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
All I can offer is that I know they are still tinkering with the work fetch rules in the BOINC core client. Hopefully they get the additional buffer feature working better soon. Sometimes it seems to me everything is working well and then they make a bunch of changes, and then all it seems to understand is deadline, and being completely out of work. Rosetta Moderator: Mod.Sense |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Is there a simple way to have BOINC "forget" whatever parameters it uses in determining how many tasks to keep in the queue to satisfy number of days work to keep on hand? Oh my god, you are over the 4 million mark already?! |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
OK - I tore into this a little and I think I see what is going on - I don't understand how things got that way, but on each of my systems which have a really short work queue, if you go the file: global_prefs_override.xml you see the tag : "work_buf_additional_days" set to 0.25 On the systems which are functioning normally this value is set to the expected 3.0 days. I am not sure why this file is not being updated - the permissions are set to 644 and the owner is the same as what BIONC runs under. Despite the fact that I change my preferences from time to time using the website, the last time this file was updated was back in early August. Although it is an "override" dataset, it is unclear to me how this value would be manually set - I haven't spotted the menu option for it yet which likely means it is looking me right in the eye. I backed up the file on one system and used vi to manually set the value to 2 - now in the wait and see mode. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Chile Man said: Oh my god, you are over the 4 million mark already?! To be honest, I had not even noticed that until you mentioned it - medical treatments and work have kept me so busy of last that I have been running on auto pilot. Haven't even really had time to pull "Biscuit Boy's" chain - which is always recreational. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Transient said ... It is in the "Network Usage" tab 3rd or 4th line on the right That was the solution. I'm not sure how those ever got set down to 0.25 days but now all my systems are sitting fat, dumb, and happy. Thanks to all who made suggestions. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,178,626 RAC: 3,201 |
Chile Man said: Hey, hey, HEY!!! Be nice or I will send my staff infection to you!!! Actually I am over it now but for awhile they thought I had MRSA, which is just an old staff infection that happens to be resistant to an antibiotic! In the end it was not MRSA but after 7 days of iv's every 12 hours and then 15 days of oral antibiotics, 2 different kinds for 5 of those days, they say I am good to go now! Between seemingly passing myself several times going back and forth to the doctors office for iv's and getting my pc's setup with a new, for me, backup process, I too have been on auto pilot for a few weeks!! I built a Windows Home Server but somewhere around the time I switched from Comcast to Verizon Fios I lost access to the backups on it. Backups would be happening but I could not restore anything and then I accidentally deleted a whole hard drives worth of data, a 500 gig hard drives worth of data!!! After a month of trying to get it back I gave up and removed all the hard drives from the machine, except the C: drive, and built another machine for them and the backups! It still runs as a Server and crunches but it no longer does ANY backups at all! I need the Server part because of the number of pc's I have here, 11 currently running. Somebody needs to be in charge and if a Server is on the system it automatically takes on that role. Anyway along with that I needed to separate out my wifes pc in the backup process. She takes a ton of digital pictures and won't delete anything! So I bought two 2tb drives and I will set them up as Raid 1 so I will have a backup of her backup. I have not bought the enclosure yet but I have one in mind that is fast and local. In the end her machine will have 1.5tb of storage and two 2tb drives for backups. The other two 1tb drives, that came out of the Server, went into the other machine I built and all other machines will be backed up to them. About half are done already but the drives that have alot of my data on them are still not done yet. When I did one pc it took 36 hours to do it over my network, so it is a time investment and a my rac takes a hit when I do it! I am glad you got your job queue back up and going like you like it! I prefer a shorter one, I have the initial cache set at 0.10 days and the additional set at 0.25 days except on 2 pc's where it is set to 0.75 days. I weather most outages okay but do run out of work occasionally! |
Michael Gould Send message Joined: 3 Feb 10 Posts: 39 Credit: 15,438,423 RAC: 4,427 |
Although it is an "override" dataset, it is unclear to me how this value would be manually set - I haven't spotted the menu option for it yet which likely means it is looking me right in the eye. Aha!! That's why I wasn't getting the amount of buffer I wanted! Man, the things you can learn lurking here! Thank you transient. And thanks Holvenstot for asking the question. |
Message boards :
Number crunching :
Job Queue
©2024 University of Washington
https://www.bakerlab.org