Something decided to bork itself...

fastgeek · May 23, 2012

So this system has been running great for the last five days or so. Checked it this morning and see the following error. Tried the obvious stuff first - delete queue.dat and work dir, no good. Delete those + machinespecific.dat, no good. All of the above + delete core + uninstall kraken = no good. Re-wrap with Kraken, no good. have rebooted. System is running fine otherwise.

Will keep messing around with it, but if someone has seen this before and knows the fix, please let me know.

BTW, I looked up the error code at this site, which says "These errors indicate an I/O hardware problem or perhaps an AV program preventing FAH from writing/reading certain work files." Thing is, there isn't any anti-virus on this machine and Ubuntu isn't bitching about anything. Keeps telling me updates are available; but have been ignoring those for days.

Code:

# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/carbon/fah
Executable: ./fah6
Arguments: -smp -bigadv 

[17:20:44] - Ask before connecting: No
[17:20:44] - User name: fastgeek (Team 33)
[17:20:44] - User ID: 752CF617297A36EA
[17:20:44] - Machine ID: 1
[17:20:44] 
[17:20:44] Loaded queue successfully.
[17:20:44] - Preparing to get new work unit...
[17:20:44] Cleaning up work directory
[17:20:44] + Attempting to get work packet
[17:20:44] Passkey found
[17:20:44] - Connecting to assignment server
[17:23:54] - Couldn't send HTTP request to server
[17:23:54] + Could not connect to Assignment Server
[17:23:54] - Successful: assigned to (130.237.232.141).
[17:23:54] + News From Folding@Home: Welcome to Folding@Home
[17:23:54] Loaded queue successfully.
[17:23:54] + Closed connections
[17:23:54] 
[17:23:54] + Processing work unit
[17:23:54] Core required: FahCore_a5.exe
[17:23:54] Core found.
[17:23:54] Working on queue slot 02 [May 23 17:23:54 UTC]
[17:23:54] + Working ...
thekraken: The Kraken 0.7-pre11 (compiled Mon May 21 13:33:37 PDT 2012 by carbon@carbon-R820P)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 2555
thekraken: Logging to thekraken.log
[17:23:54] 
[17:23:54] *------------------------------*
[17:23:54] Folding@Home Gromacs SMP Core
[17:23:54] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[17:23:54] 
[17:23:54] Preparing to commence simulation
[17:23:54] - Looking at optimizations...
[17:23:54] - Created dyn
[17:23:54] - Files status OK
[17:23:54] Couldn't Decompress
[17:23:54] Called DecompressByteArray: compressed_data_size=0 data_size=0, decompressed_data_size=0 diff=0
[17:23:54] -Error: Couldn't update checksum variables
[17:23:54] Error: Could not open work file
[17:23:54] 
[17:23:54] Folding@home Core Shutdown: FILE_IO_ERROR
[17:23:55] CoreStatus = 75 (117)
[17:23:55] Error opening or reading from a file.
[17:23:55] Deleting current work unit & continuing...

fastgeek · May 23, 2012

FWIW I turned off -bigadv and it is working. Perhaps the server is giving out bad BA WUs?

Amaruk · May 23, 2012

This looks like the 512K payload issue. This can be verified by adding -verbosity 9 argument and looking for the following line in the log:

[20:37:17] Initial: 0000; - Receiving payload (expected size: 512)

If that turns out to be the case, it's likely PG will have to manually kill it.

tear · May 23, 2012

See this thread: http://hardforum.com/showthread.php?t=1686430

fastgeek · May 23, 2012

Amaruk,

Have added that flag and bigadv. Stopped and restarted the client; will report back after this standard smp unit is done.

Tear,

I see you used "borked' too; Swedish Chef? Thing is, I did everything mentioned in there (delete queue, machine and work) but it still didn't do the trick.

Will just have to see what comes up in a little bit.

Amaruk · May 23, 2012

fastgeek said:
Thing is, I did everything mentioned in there (delete queue, machine and work) but it still didn't do the trick.

I've had the same experience, sometimes it's just luck of the draw getting the same bad WU after changing machine ID.
I've also briefly switched to A3 projects in the past to work around it, hopefully that works for you too.

tear · May 23, 2012

That means you're just unlucky -- repeat until you get good unit...

Did you add -verbosity 9 and actually see 512-byte payload size?

fastgeek · May 23, 2012

Yes, I just got a chance to check the system after a nearly 2hr meeting; that little 512-byte message did indeed pop up. Have moved it back to -smp for now and will try again tomorrow.

tear · May 23, 2012

No, no... don't wait. Wipe them again.

Contents of machinedependent.dat are the identifying information
to the servers that give WUs out.

The moment new machinedependent.dat is generated server will
think it's a new machine.

If it still decides to give a "bad WU" out it's just bad luck....

Amaruk · May 23, 2012

tear is right. If you have time to babysit keep trying to delete the files.

fastgeek · May 24, 2012

OK, this is odd. For whatever reason, all three of my bigadv systems have been "hit" with this problem. Finally got the big guy running again after a few more tries (did let the regular smp finish since I was going to be around).

Since I was here and the other 2P and 4P were finishing up I kept an eye on them; both systems had the exact same problem!

The 4P is up and running again, just waiting to see if this 2P will get a WU again or not. (It didn't; giving it one more go before it runs smp for the night)

Sure hope I don't have to constantly baby sit these things constantly.

*edit* Still getting the same error on that 2P system; so off to regular SMP land it goes. Ah well. */edit*

Amaruk · May 24, 2012

Posted by kasson HERE last November regarding 130.237.232.141

kasson said:
We had a downtime a little while ago; I think some jobs got introduced into the server system that don't have corresponding work unit files. I think that's what causes the FILE_IO_ERROR on start. Unfortunately, I haven't figured out a reliable way to clear the jobs. Fortunately, the server itself clears the jobs after trying them a few times. But it can cause these "false starts" that you see.

Hopefully it's cleared itself up by morning.

fastgeek · May 24, 2012

Well my "big guy" spent six hours over night trying to get a non-corrupted WU. Have done the usual delete q/m/w and restarted. That's almost a complete WU worth of time wasted.

Let's hope it's cleared up now.

*edit* Still all screwed up and several more hours wasted. Have moved them to SMP for now. Better to be doing something that keep getting these malformed WU's.

fastgeek · May 24, 2012

Tried again, still getting junk.

Dear FAH servers,

Was it something I said?

Love,
fastgeek

*edit*
However, it's kind of amusing seeing this...

[01:57:23] Unit 3 finished with 99 percent of time to deadline remaining.
[01:57:23] Updated performance fraction: 0.991432
[01:57:23] Sending work to server
[01:57:23] Project: 6098 (Run 9, Clone 38, Gen 180)

*edit 2*
Got one at last. It's "only" a 6900, but I'll take it. Now to hope that when it finishes in 7.5hrs that it gets another one. Damn I wish that I could remote into these things from outside work!

fastgeek · May 25, 2012

Realize this is turning into my own personal little bitch-fest thread... but...

DAMN IT, STANDFORD / PANDE GROUP / FAH! WHY NOW?!

With the big push to hit 900M, and this being the end of my ability to use these servers (as of Tuesday morning for a while at least; maybe longer), it really f'ing sucks to be having this problem. Got two BA WU's yesterday, and they're done, but now any attempt on all three boxes = NFG. So, sorry team, all I can give at regular SMP's with 99% of the deadline remaining vs. BA's with 95%. (on at least one box)

On the grand scale of things this is nothing; but am competitive and like to get all I can out of things. So sue me.

sbinh · May 25, 2012

cuz you turned off -bigadv option ....

fastgeek · May 25, 2012

sbinh said:
cuz you turned off -bigadv option ....

Were you being trying to be sarcastic?

With bigadv I get corrupted WU's. Or maybe after deleting the work dir + queue + machinedependent I will get get lucky and get one or two before the corruption starts again. Thought all of this was pretty clear.

Since I wont be around to baby sit these three machines, I've turned off BA; so yes, they're just doing regular SMP. Having a system that can do 1M PPD by itself only able to do non-BA WU's kind of sucks.

ChelseaOilman · May 25, 2012

I got the 512 byte WUs on one of my machines last night. Lasted about a half hour before getting a good WU without doing anything. I emailed Peter Kasson this morning and he said he cleared out the bad WUs again. We'll see.

Something decided to bork itself...

fastgeek

[H]ard|DCOTM x4 aka "That Company"

fastgeek

[H]ard|DCOTM x4 aka "That Company"

Amaruk

n00b

tear

[H]ard|DCer of the Year 2011

fastgeek

[H]ard|DCOTM x4 aka "That Company"

Amaruk

n00b

tear

[H]ard|DCer of the Year 2011

fastgeek

[H]ard|DCOTM x4 aka "That Company"

tear

[H]ard|DCer of the Year 2011

Amaruk

n00b

fastgeek

[H]ard|DCOTM x4 aka "That Company"

Amaruk

n00b

fastgeek

[H]ard|DCOTM x4 aka "That Company"

fastgeek

[H]ard|DCOTM x4 aka "That Company"

fastgeek

[H]ard|DCOTM x4 aka "That Company"

sbinh

2[H]4U

fastgeek

[H]ard|DCOTM x4 aka "That Company"

ChelseaOilman

[H]ard|Gawd