[RndTbl] sshd: Corrupted MAC on input.
Gilbert E. Detillieux
gedetil at cs.umanitoba.ca
Thu Jul 30 10:07:20 CDT 2020
On 2020-07-29 8:31 p.m., Trevor Cordes wrote:
> On 2020-07-29 Gilbert E. Detillieux wrote:
>> What's the likely cause of this? A bad NIC? Bad RAM? (I'm guessing
>> something is corrupting the packets once in a while, but I'm not sure
>> what. If so, it seems to get past TCP's error correcting.)
>
> I would try the same type of transfer using a different client to the
> same server. Then try a different server for the same client. If you
> can get the same behavior with a different server, that would be
> extremely useful.
This is from a local backup server to an off-site backup. I can easily
try a different local server, but won't be able to exactly replicate the
rsync, though I can try with other large file(s). As for a different
remote destination, that's not easily replicated, but I'd at least know
if the problem is limited to the off-site data path and/or server.
> You could also try using nc from /dev/zero from the server to the
> client into a file, then use a script (or something) to check if the
> file is all zeros.
A script? Just using "od" would tell me that. :)
> It would be neat to see the actual corruption that
> occurs. Make sure nc is using TCP (though UDP would be an interesting
> test as well, but not critical or required).
>
> You're right that TCP shouldn't really allow such (line) errors to get
> through to the ssh layer.
TCP checksums aren't perfect, and with very large transfers, there is a
statistically significant probability of errors getting through, if the
underlying layers aren't doing their job. (Normally, Ethernet frame
errors are more likely to weed out the bad packets than TCP checksums,
but I remember in the days of PPP over dial-up, that TCP checksums were
often inadequate. If we've got problems with something in the Ethernet
data path letting through bad packets, sshd could be seeing errors that
TCP misses.)
> If your NIC has TCP checksum offloading, try turning it off (ethtool is
> what I used to use for that, not sure if it's still "the way"). That
> will eliminate the NIC and bus from the equation, leaving you with
> RAM/CPU and/or mobo between the two (but not out to the cards/bridge).
>
> If you turn off offloading and the problem goes away, your transfer
> performance should tank because it'll be doing TCP retries each time.
Good suggestion. This is an onboard Intel NIC, and on another server, I
had to do this...
# Prevent Intel e1000e hangs/resets due to buggy GSO, GRO and TSO.
# As suggested here...
#
https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang
ethtool -K em1 gso off gro off tso off
It's a different chipset here, and I'm not seeing this specific error,
but it could be something chipset-related anyway.
> My guess, as always, is... wait for it... bad caps on the board, likely
> near the NIC slot, or, if onboard, near the NIC onboard chip. I've had
> weird NIC behavior before and it's always turned out to be the caps
> near the card slot, usually 1000uf little jobbers.
>
> I just decommissioned my main workstation I used since 2008(!) that was
> starting to get occasional VGA lockups, and lo and behold, the caps
> near the slots were just starting to get puffy (on a very high end
> Intel board). I'll be repairing them soon to repurpose the system.
>
> P.S. If a repair or replacement isn't possible for a while, sometimes
> moving the NIC as far away from the puffiest caps can help for a while
> until more caps go bad. Each 1 or 2 slots usually gets its own cap(s).
> Also, putting in a junkier NIC might help if it draws less power.
> These cap problems are always exacerbated by higher (transient/peak)
> power draws.
I had thought of just putting in a network card, and disabling the
onboard NIC, but I didn't want to do that until I was sure it was the
NIC and not something software related or MB related. And since this is
an off-site system (albeit still on campus), I have to coordinate with
someone else who's normally working from home these days.
So, looking for things I can test remotely, at the moment...
> Keep us posted!
Will do.
Gilbert
--
Gilbert E. Detillieux E-mail: <gedetil at cs.umanitoba.ca>
Dept. of Computer Science Web: http://www.cs.umanitoba.ca/~gedetil/
University of Manitoba Phone: (204)474-8161
Winnipeg MB CANADA R3T 2N2 Fax: (204)474-7609
More information about the Roundtable
mailing list