I have a Pi3B which has a SSD mounted on it. When I try to rsync parts of the SD card to another PI (with a disc mounted on it), it will work for a short period of time (4 days - one rsync/day), then fails. When it fails, it syncs some files, then hangs. The session eventually stops when the SSH link times out. If umount then mount the SSD, things are back to normal. Access to the SSD seems to be OK.
Unmounting and remounting the SSD before the rsync operation seems to be a bodge to get around this problem, but I haven't been able to find any suggestions online as to what the problem might be.
A bit of background. This PI is a replacement for one which failed, new Pi, new SD card, and a new disk (the old one wouldn't mount on the PI, but would on my Ubuntu PC). The script is the same one that I was using with the old config, and that worked happily for several years.
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
Ethernet all the way. Source Pi is connected to a switch, that switch is connected to another switch, and the target Pi is connected to the second switch.
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
That is the odd bit. When run as a cron job, the log file (/tmp/rsync.log) shows the first few files being sent, and that is all. If I run the command from the command line, I don't get any errors from rsync, all I get is a message about the SSH link timing out (sorry, I don't have a copy of the message).
/var/log/syslog shows the cron job starting, but there is nothing there to suggest that it failed.
Thanks
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
The apparent difference in error output is because you?re only capturing stdout to rsync.log, not stderr. The redirection operators are in the wrong order if that?s not what you wanted.
That will have to wait for it to fail again (it is too far up my terminal screen for the buffer)
Now fixed. It might throw up an error tomorrow, or it could be a few days before it happens again.
Thanks
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
sent 4,058,402 bytes received 75,386 bytes 56,242.01 bytes/sec total size is 4,119,794,435 speedup is 996.61
A bit of digging online throws up several similar errors, but they seem to be "this has never worked" rather than an intermittent problem (which is what I have). There is also a suggestion that the problem might be incompatible versions of rsync (in which case why does the push sometimes work, and a pull operation always work). Host is 3.1.2, target is 3.0.9).
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
packet_write_wait suggests that (from the point of view of the local SSH endpoint) the remote SSH endpoint terminated. So, logs on that system would be worth a look.
What?s the basis for saying there?s a timeout involved?
Looking at the target Pi, there is nothing of note in either syslog or messages, however, auth.log has the following :
04:05:02 sshd[13415]: Nasty PTR record "192.168.1.18" is set up for
192.168.1.18, ignoring
04:05:02 sshd[13145]: Accepted publickey for root from 192.168.1.18 port
49632 ssh2
04:05:02 sshd[13145]: pam_unix(sshd:session): session opened for user root by (uid=0)
04:06:14 sshd[13145]: Received disconnect from 192.168.1.18: 11: disconnected by user
04:06:14 sshd[13145]: pam_unix(sshd:session): session closed for user root
However, the script which runs the rsync then emails me the log, and the email is timed at 04:21, so something was "hanging" for about 15 minutes, which is what usually happens when it fails.
I've just rerun the cron job (altered the scheduled time), and added a couple of date commands to the script, one at the start, and one after the rsync command, and I get this :
Sun 31 May 12:50:01 UTC 2020
sending incremental file list etc/fake-hwclock.data home/pi/.bash_history home/pi/.cache/lxsession/LXDE-pi/run.log home/pi/stats/temperature_log_200531.csv root/bin/ root/bin/rsync_script.sh packet_write_wait: Connection to 192.168.1.118 port 22: Broken pipe rsync: connection unexpectedly closed (58154 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.2] Sun 31 May 13:07:00 UTC 2020
Running watch "ps -ef | grep rsync" on both source and target machines, the source machine lost its rsync processes at 13:07, but the target machines were still running 20 minutes later.
When this first happened, I thought I saw a message mentioning a time out, but I'm not seeing one now.
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
The connection to the target machine eventually closed after 121 minutes. "session closed for user root"
Running the rsync command from the command line also saw it fail to complete (target rsync session still running), however, repeating the exercise, this time with redirect to the log file saw completion within a couple of minutes, and the target sessions for the final run also terminated, but those for the second run are still present.
Run 3 has 5 entries in the auth.log file, three at the start, and two at the end : Received disconnect from 192.168.1.18: 11: disconnected by user pam_unix(sshd:session): session closed for user root
The first of those two hasn't appeared on those runs that failed, even after the local sessions die (for whatever reason).
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
That looks like a normal SSH disconnect[1]. There?s quite a few reasons the OpenSSH client might send that with the most obvious being rsync closing the pipe to the SSH client.
There?s an apparent inconsistency here, assuming that these log traces reflect the same session:
The client reports that server has closed the TCP session when it still had something to write.
The server thinks the client has sent it a disconnect in the SSH session.
The former could be faked by an intermediate router but I think you said there wasn?t one, and the latter is not practical to fake.
Are you familiar with strace? At this point that?s definitely the tool I?d reach for.
The LAN has a router (which also acts as the modem and handles local DHCP etc), but physically, it is beyond the section of network that the source and target are on. Whether packets take the direct route or the scenic route, I don't know.
I've heard of it, but I haven't (yet) had a reason to use it.
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
FWIW, I see something fairly similar from time to time but its fairly rare - I can go months without seeing it, and then I get one glitch.
My setup:
Hardwired network, all machines (two Lenovo laptops and an ancient dual Athlon box, all running 64bit Fedora plus one series 2 RPi running Raspbian Buster) are connected to a single D-link switch. My ADSL Modem/ router is also connected to the switch along with a M402dne Laserjet.
Backups are to a cycle of two of WD Essentials 1TB USB drives, stored offline and connected to the dual Athlon box when I make a weekly backup of all four systems immediately before a weekly software update. I use rsync to do the backups, controlling and monitoring the process from a SSH login on the Athlon box.
When a backup is in progress I see a progress report (i.e. files being updated on the USB drives scroll up the console window that's logged in to the Athlon. Sending copies of rsync are launched remotely from input the the login session on the Athlon.
If the backup breaks its always because the sending copy of rsync that has failed: I notice that the screen on the receiving end is no longer scrolling up and, if I login to the sending system its always its rsync instance that's crashed.
I learned long ago that this type of rsync backup arrangement is always restartable without data loss (because the file that was being backed up is the first to be backed up again when the run is restarted), so I recover by killing the receiving copy of rsync (this forces a cleanup at the receiving end) and then restarting that backup. The only loss is of time - the time it takes to notice that a backup has failed, clean up and restart it. Because recovery from such an infrequent failure is very fast, I will admit to not investigating the causes in any detail.
If mine was occurring once in a blue moon, I might be inclined to ignore it (or tweak the script to automatically rerun), but it seems to be every 2-3 days (it failed again overnight), which is annoying me.
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
In the meantime, the target is churning out line after line of :
select(4, [3], [], NULL, {60, 0}) = 0 (Timeout)
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
I didn't read that :) but my guess would certainly be that it has nothing to do with rsync and everything with a lost connection/route. Maybe a flaky ethernet cable or jack? It could happen. Try & set up a ping log from both to both and to the router.
Fair comment. As much as anything I wanted to flag up the point that, if an rsync backup fails, it is quick and easy to restart and, because of the way that files are selected for replication, the chances of data loss are fairly small - but you're probably already aware of this.
I've already got a ping log (cron job, every minute) going from each machine to the router, and they are not reporting any drops, and are usually ~0.4mS. The source machine uses a switch which has two other Pis attached to it, and they are also pinging away every minute without problem. The scripts running he pings have a delay in them, so each one pings at a different time (assuming that the clocks are precisely synced).
Adrian
--
To Reply :
replace "bulleid" with "adrian" - all mail to bulleid is rejected
Sorry for the rigmarole, If I want spam, I'll go to the shops
Every time someone says "I don't believe in trolls", another one dies.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.