Opinions on error recovery?

H. Peter Anvin hpa@zytor.com
Mon, 29 Oct 2001 18:19:03 -0800

Hello people... I would really like people's opinion on what the behaviour
of LPSM on checkpointing failure (disk full, for example) should be.
Checkpointing failure is detected asynchronously, so returning an error
the obvious way isn't possible.

Currently, LPSM sends SIGABRT to the master process; a bit crude, and not
safe if another checkpoint process has since been started.

It seems there are only a small handful of options:

a) Send a signal to the main process; make sure interlocks are in place so
that any further checkpoint processes are terminated without writing any
commit records.

This minimizes the processing that can actually take place without
anywhere to put the data.  It also makes sure that on restart we will roll
back to the last consistent state we could actually store on the backup

b) Reconstruct the dirty mask and keep running, hoping that a future
checkpoint will succeed.

This requires some pretty complicated tracking in the main process of what
has actually been committed by which process, but that's not too bad to
deal with.  It also means that we can continue running uninterrupted and
after manual intervention to clear the problem checkpoint all data.

The problem is that it allows for an unbounded amount of computation to
happen without a backing store.

I would appreciate to hear what people think.