Content-Type: text/x-zim-wiki
Wiki-Format: zim 0.4
Creation-Date: 2011-12-24T12:26:39+08:00

====== writing-robust-shell-scripts ======
Created Saturday 24 December 2011
http://www.davidpashley.com/articles/writing-robust-shell-scripts.html

Many people hack together shell scripts quickly to **do simple tasks**, but these soon take on a life of their own. Unfortunately shell scripts are **full of subtle effects** which result in scripts failing in unusual ways. It's possible to write scripts which minimise these problems. In this article, I explain several techniques for writing robust bash scripts.

===== Use set -u =====

How often have you written a script that broke because a variable wasn't set? I know I have, many times.

chroot=$1
...
rm -rf $chroot/usr/share/doc 

If you ran the script above and accidentally forgot to give a parameter, you would have just deleted all of your system documentation rather than making a smaller chroot. So what can you do about it? Fortunately bash provides you with **set -u**, which will exit your script if you try to use __an uninitialised variable__. You can also use the slightly more readable** set -o nounset**.

david% bash /tmp/shrink-chroot.sh
$chroot=
david% bash -u /tmp/shrink-chroot.sh
/tmp/shrink-chroot.sh: line 3: $1: **unbound variable**
david% 

===== Use set -e =====

Every script you write should include** set -e at the top**. This tells bash that it should **exit the script if any statement returns a non-true return value**. The benefit of using -e is that it__ prevents errors snowballing into serious issues __when they could have been caught earlier. Again, for readability you may want to use **set -o errexit**.

Using -e gives you error checking for free. If you forget to check something, bash will do it or you. Unfortunately it means you __can't check $?__ as bash will** never get to** the checking code if it isn't zero. There are other constructs you could use:

command
if [ "$?"-ne 0]; then echo "command failed"; exit 1; fi 

could be replaced with

**command || { echo "command failed"; exit 1; }   #整个用&& ||连起来的各命令组成了一个整体，其退出值为最后一个**__执行__**的命令。**

or

if ! command; then echo "command failed"; exit 1; fi   #__整个控制结构是一个整体__，可以对这个整体进行输入\出重定向。

What if you have a command that returns non-zero or you are not interested in its return value? You can use **command || true**, or if you have a longer section of code, you can turn off the error checking, but I recommend you use this sparingly.

set +e
command1
command2
set -e 

On a slightly related note, by default bash takes __the error status of the last item__ in a pipeline, which may not be what you want. For example, **false | true** will be considered to have succeeded. If you would like this to fail, then you can use **set -o pipefail** to make it fail.

===== Program defensively - expect the unexpected =====

Your script** should take into account of the unexpected**, like files missing or directories not being created. There are several things you can do to prevent errors in these situations. For example, when you create a directory, if the parent directory doesn't exist, mkdir will return an error. If you add a __-p__ option then mkdir will create all the parent directories before creating the requested directory. Another example is rm. If you ask rm to delete a non-existent file, it will complain and your script will terminate. (You are using -e, right?) You can fix this by using __-f__, which will silently continue if the file didn't exist.

===== Be prepared for spaces in filenames =====

__Someone will always use spaces in filenames or command line arguments and you should keep this in mind__ when writing shell scripts. 

In particular you** should use quotes around variables**.

if [ $filename = "foo" ]; 

will fail if $filename contains a space. This can be fixed by using:

if [ **"$filename" **= "foo" ]; 

When using $@ variable, you should __always quote it__ or any arguments containing a space will be expanded in to separate words.

david% foo() { for i in $@; do echo $i; done }; foo bar "baz quux"
bar
baz
quux
david% foo() { for i in** "$@"**; do echo $i; done }; foo bar "baz quux"
bar
baz quux 

I can not think of a single place where you shouldn't use "$@" over $@, so when in doubt, use quotes.

If you use **find and xargs** together, you should use __-print0 to separate filenames(可能包括路径) __with a **null character **rather than __new lines__. You then need to use -0 with xargs.

david% touch "foo bar"
david% **find | xargs ls**
ls: ./foo: No such file or directory
ls: bar: No such file or directory
david% find __-print0__ | xargs -0 ls
./foo bar 

===== Setting traps(设置trap捕获信号后，除非明确exit，否则脚本继续执行) =====

Often you write scripts which fail and__ leave the filesystem in an inconsistent state__; things like lock files, temporary files or you've updated one file and there is an error updating the next file. It would be nice if you could fix these problems, either by deleting the lock files or by __rolling back to a known good state__ when your script suffers a problem. Fortunately bash provides a way to run a command or function when it receives a unix signal using the trap command.

**trap command signal [signal ...] **

There are many signals you can trap (you can get a list of them by running__ kill -l__), but for __cleaning up after problems__ there are only 3 we are interested in: INT, TERM and EXIT. You can also** reset traps back** to their default by using __-__ as the command.
Signal 	Description
**INT** 		Interrupt - This signal is sent when someone kills the script by pressing ctrl-c.
**TERM **	Terminate - this signal is sent when someone sends the TERM signal using the kill command.
**EXIT **	Exit - this is a **pseudo-signal** and is triggered when your **script exits**, either through reaching the** end of the script**, an** exit **command or by a command failing when using set -e.

Usually, when you write something using a lock file you would use something like:

if [ ! -e $lockfile ]; then
   touch $lockfile
   critical-section   #脚本在这个过程中退出时，lockfile就会留在系统中
   rm $lockfile
else
   echo "critical-section is already running"
fi

What happens if someone kills your script while critical-section is running? The lockfile will be **left there** and your script won't run again until it's been deleted. The fix is to use:

if [ ! -e $lockfile ]; then
   **trap "rm -f $lockfile; exit" INT TERM EXIT    #在捕获信号前需安装信号及其执行的命令**
   touch $lockfile
   critical-section
   rm $lockfile
  ** trap - INT TERM EXIT                                 #将信号的处理处理恢复到缺省状态(一般异常地终止脚本的继续执行)**
else
   echo "critical-section is already running"
fi

Now when you **kill the script** it will delete the lock file too. Notice that we __explicitly exit__ from the script at the end of trap command, otherwise the script will **resume from the point **that the signal was received.

===== Race conditions =====

It's worth pointing out that there is a slight race condition in the above lock example between the time we test for the lockfile and the time we create it. A possible solution to this is to use__ IO redirection and bash's noclobber mode, which won't redirect to an existing file__. We can use something similar to:

if ( __set -o noclobber__; echo "$$" > "$lockfile") 2> /dev/null; 
then
   trap 'rm -f "$lockfile"; exit $?' INT TERM EXIT

   critical-section
   
   rm -f "$lockfile"
   trap - INT TERM EXIT
else
   echo "Failed to acquire lockfile: $lockfile." 
   echo "Held by $(cat $lockfile)"
fi 

A slightly more complicated problem is where you need to update a bunch of files and need the script to__ fail gracefully__ if there is a problem in the middle of the update. You want to be certain that something either happened correctly or that it appears as though it **didn't happen at all**.Say you had a script to add users.

add_to_passwd $user
cp -a /etc/skel /home/$user
chown $user /home/$user -R

There could be problems if you ran out of diskspace or someone killed the process. In this case you'd want the user to not exist and all their files to be removed.

__rollback() __{
   del_from_passwd $user
   if [ -e /home/$user ]; then
      rm -rf /home/$user
   fi
  ** exit   #错误发生，脚本退出执行。**
}

**trap rollback INT TERM EXIT**
add_to_passwd $user
cp -a /etc/skel /home/$user
chown $user /home/$user -R
**trap - INT TERM EXIT**

We needed to remove the trap at the end or the rollback function would have been called as we exited, undoing all the script's hard work.

===== Be atomic =====

Sometimes you need to update a bunch of files in a directory at once, say you need to rewrite urls form one host to another on your website. You might write:

for file in $(find /var/www -type f -name "*.html"); do
   perl -pi -e 's/www.example.net/www.example.com/' $file
done

Now if there is a problem with the script you could have half the site referring to www.example.com and the rest referring to www.example.net. You could fix this using a backup and a trap, but you also have the problem that the site will be inconsistent during the upgrade too.

The solution to this is to __make the changes an (almost) atomic operation__. To do this make a copy of the data, **make the changes in the copy**, move the original out of the way and then move the copy back into place. You need to make sure that both the old and the new directories are moved to locations that are__ on the same partition__ so you can take advantage of the property of most unix filesystems that moving directories is very fast, as they only have to update the inode for that directory.

cp -a /var/www /var/www-**tmp**
for file in $(find /var/www-tmp -type f -name "*.html"); do
   perl -pi -e 's/www.example.net/www.example.com/' $file
done
__mv /var/www /var/www-old__
mv /var/www-tmp /var/www 
虽然上面几步的cp mv过程，有可能发生中断，但是可以保证，__在最坏的情况下，系统总有一份可以恢复的文件__。
This means that if there is a problem with the update, the live system is not affected. Also the time where it is affected is reduced to the time between the two mvs, which should be very minimal, as the filesystem just has to change two entries in the inodes rather than copying all the data around.

The disadvantage of this technique is that you need to use twice as much disk space and that any process that keeps files open for a long time will still have the old files open and not the new ones, so you would have to restart those processes if this is the case. In our example this isn't a problem as__ apache opens the files every request__. You can check for files with files open by using lsof. An advantage is that you now have a backup before you made your changes in case you need to revert.