File transfers.

Message boards : Number crunching : File transfers.

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 91662 - Posted: 8 Feb 2020, 9:19:26 UTC
Last modified: 8 Feb 2020, 9:20:36 UTC

I noticed yesterday a Rosetta on my list in the "downloading" state. Some time later, it was still in the downloading state, so I went to transfers poked and prodded it, the download starts, but stops at 46.22%. retry does the same. It is still like that today. Server status looks normal.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 91663 - Posted: 8 Feb 2020, 10:34:14 UTC

I'm having the same problem with two machines. It happens occasionally, but it's been bad the past 24 hours.
ID: 91663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bfromcolo

Send message
Joined: 25 Apr 13
Posts: 2
Credit: 1,294,095
RAC: 0
Message 91665 - Posted: 8 Feb 2020, 20:22:27 UTC

I have had 3 tasks on 2 machines hung like this for hours, and these are very small downloads. To make matters worse it stops other work from being downloaded, at least sometimes, its not consistent here. Retrying the transfer didn't help with any of them. Aborting the transfer did help, it caused the associated work unit to fail, next update everything is back in order.

Sat 08 Feb 2020 08:26:01 AM MST | Rosetta@home | Not requesting tasks: some download is stalled
ID: 91665 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 91670 - Posted: 9 Feb 2020, 13:37:20 UTC

Still like that today. I aborted the transfer. Other jobs downloaded and started quickly.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 91670 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 91671 - Posted: 9 Feb 2020, 20:35:37 UTC

Just a note to help others not have to 'abort transfer' (and thus inadvertently abort tasks that may then never get completed and thus impact research) I've found that closing the BOINC client including checking the checkbox that says 'Stop running tasks when exiting the BOINC manager' and re-starting it, force-retries the downloads and they usually succeed.

Still this is definitely a networking issue on the UW side. Hopefully someone reads this forum post.
**38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research
ID: 91671 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 13,353
Message 91677 - Posted: 12 Feb 2020, 8:01:11 UTC

I also have few stuck files in last few days.
And BOINC also stop getting new work from R@H completely until i have noticed it today and aborted stuck file transfers.
ID: 91677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 9,701
Message 91766 - Posted: 24 Feb 2020, 13:21:37 UTC

I've had this over the last few weeks - not entirely sure it's fixed even now.
The biggest issue is unattended machines for a period of time longer than my overall buffer size - in my case 24-34hrs
New tasks are prevented from coming down while a download is stalled (always a very small zip file) until all Rosetta tasks in my buffer are complete, so tasks are drawn from my backup project to completely fill the buffer instead.
Once the stalled filetask is manually abortedcleared, my priorities between Rosetta and backup project mean backup tasks are all ignored unless they're manually forced to run, so there's a further day or two of clearing them out before the machine becomes unattended again with the prospect of another failed Rosetta download and everything repeats itself.
This has been a constant job almost every single day of the last two weeks over 4 machines in 3 different locations, so if anyone can find a way of preventing this recurring I'd really appreciate it. It's not ben funny.
ID: 91766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 91768 - Posted: 24 Feb 2020, 14:42:07 UTC - in response to Message 91766.  
Last modified: 24 Feb 2020, 15:29:45 UTC

Once the stalled filetask is manually abortedcleared, my priorities between Rosetta and backup project mean backup tasks are all ignored unless they're manually forced to run, so there's a further day or two of clearing them out before the machine becomes unattended again with the prospect of another failed Rosetta download and everything repeats itself.

That is annoying, I know. But if you have set the backup as a zero resource share, it will eventually clear itself out in order to meet its expiration date. It will just sit around for a while.
ID: 91768 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 13,353
Message 91814 - Posted: 1 Mar 2020, 2:24:50 UTC

Yes, it will clear itself but in a not a good way - BOINC will just ignore such tasks from project with "zero" resource share until it almost hit theirs deadlines, it trigger "panic mode" and BOINC reallocate all resources to it to be able finish it before deadline. But sometimes it still miss some deadlines as tasks duration estimates are far from perfect and some WU can take a way longer than BOINC thinks.
And do some other stupid thing while in "panic mode" like ignoring CPU cores reservation setting (like i set to use 90% CPUs at max = 7 of 8 cores, but BOINC in "panic mode" will use all 8) or start pausing GPU work to free more cpu cores for CPU WU risking cross deadline and other thing which was never allowed to do.
ID: 91814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Om
Avatar

Send message
Joined: 18 Feb 20
Posts: 16
Credit: 777,076
RAC: 0
Message 91967 - Posted: 14 Mar 2020, 16:16:26 UTC - in response to Message 91662.  
Last modified: 14 Mar 2020, 16:23:13 UTC

.
ID: 91967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Om
Avatar

Send message
Joined: 18 Feb 20
Posts: 16
Credit: 777,076
RAC: 0
Message 91968 - Posted: 14 Mar 2020, 16:16:26 UTC - in response to Message 91662.  

March 14th and the issue continues. I have one stuck at 82.22%. Aborting seems to be the only option...
ID: 91968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 70
Credit: 267,358
RAC: 374
Message 91972 - Posted: 14 Mar 2020, 19:37:32 UTC

This thread/ topic is duplicate to Message boards : Number crunching : Stalled downloads
Let's not make multiple topics on SAME issue!

ID: 91972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 9,701
Message 91980 - Posted: 15 Mar 2020, 8:44:08 UTC - in response to Message 91768.  

Once the stalled filetask is manually abortedcleared, my priorities between Rosetta and backup project mean backup tasks are all ignored unless they're manually forced to run, so there's a further day or two of clearing them out before the machine becomes unattended again with the prospect of another failed Rosetta download and everything repeats itself.

That is annoying, I know. But if you have set the backup as a zero resource share, it will eventually clear itself out in order to meet its expiration date. It will just sit around for a while.

I set it to 96.67% Rosetta to 3.33% WCG, but that's not the issue I'm seeing. Once all Rosetta tasks are complete, barring the stalled download Rosetta task, my entire buffer fills with the backup project, so I get 2.0 or 2.4 days of WCG tasks.

When I resolve the Rosetta issue, I can manually force the WCG tasks to run (4 or 8 tasks at a time, depending on the cores for that machine) but as soon as they finish, Rosetta starts again and I have to manually start more WCG tasks. It's very boring as well as annoying. And when I'm at that location, I'm in one of two places for half a day at a time, so it can take 2 or 3 days to clear them or, as has just been the case, I don't get to clear them all in 3 days and have to leave for my other location for 3-4 days.

I could just abort all the WCG tasks, I suppose, but I don't like to do that. If they run, then I'm sure of a long unattended run on Rosetta to catch up the debt. Which is great unless another Rosetta download fails and then I'm back to square one, resolving a task that's failed while unattended there.

This has been going on for nearly a month. To say I'm thoroughly sick and tired of it all would be an understatement.
ID: 91980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : File transfers.



©2024 University of Washington
https://www.bakerlab.org