Opinions on error recovery?

Christopher Quinn cq@htec.demon.co.uk
Tue, 18 Dec 2001 18:54:23 +0000

 > Hello people... I would really like people's opinion on what the behaviour
 > of LPSM on checkpointing failure (disk full, for example) should be.
 > Checkpointing failure is detected asynchronously, so returning an error
 > the obvious way isn't possible.
 > Currently, LPSM sends SIGABRT to the master process; a bit crude, and not
 > safe if another checkpoint process has since been started.
 > It seems there are only a small handful of options:
 > a) Send a signal to the main process; make sure interlocks are in place so
 > that any further checkpoint processes are terminated without writing any
 > commit records.
 > This minimizes the processing that can actually take place without
 > anywhere to put the data.  It also makes sure that on restart we will roll
 > back to the last consistent state we could actually store on the backup
 > medium.
 > b) Reconstruct the dirty mask and keep running, hoping that a future
 > checkpoint will succeed.
 > This requires some pretty complicated tracking in the main process of what
 > has actually been committed by which process, but that's not too bad to
 > deal with.  It also means that we can continue running uninterrupted and
 > after manual intervention to clear the problem checkpoint all data.
 > The problem is that it allows for an unbounded amount of computation to
 > happen without a backing store.
 > I would appreciate to hear what people think.


In writing my own persistent store stuff I took the view that if there was any
chance of user intervention being viable then provide a handling hook.
Returning from the function is interpreted as continue as normal (unbounded
computation case), and up to the function whether to launch a thread continuation
or simply block  ie. a) display a window message and block further processing by 
not returning immediately, or b) launch thread to display message, while 
returning from handler.

Perhaps I could solicit opinion on something?
I see that by using a file based log, as does lpsm, it is trivial to detect its 
end. But what if you were to use a block device?
Then it is not so simple because there is the matter of previously written log 
material being wrongly interpreted as part of a valid log record.
I use a double root block scheme which reserves space at the head of the log device.
But I am not happy with it since it entails a change of disk head position after 
the regular log record write. The alternative I guess is to include some sort of 
checksum with the last record written so as to provide a measure of protection 
against inadvertent collision with previous logging. But what is a 
computationally inexpensive operation and what degree of certainty is enough!?
Have you thought about this issue at all?

Chris Q.