Opinions on error recovery?

Christopher Quinn cq@htec.demon.co.uk
Tue, 18 Dec 2001 18:44:26 +0000


> Hello people... I would really like people's opinion on what the behaviour
> of LPSM on checkpointing failure (disk full, for example) should be.
> Checkpointing failure is detected asynchronously, so returning an error
> the obvious way isn't possible.
> 
> Currently, LPSM sends SIGABRT to the master process; a bit crude, and not
> safe if another checkpoint process has since been started.
> 
> It seems there are only a small handful of options:
> 
> a) Send a signal to the main process; make sure interlocks are in place so
> that any further checkpoint processes are terminated without writing any
> commit records.
> 
> This minimizes the processing that can actually take place without
> anywhere to put the data.  It also makes sure that on restart we will roll
> back to the last consistent state we could actually store on the backup
> medium.
> 
> 
> b) Reconstruct the dirty mask and keep running, hoping that a future
> checkpoint will succeed.
> 
> This requires some pretty complicated tracking in the main process of what
> has actually been committed by which process, but that's not too bad to
> deal with.  It also means that we can continue running uninterrupted and
> after manual intervention to clear the problem checkpoint all data.
> 
> The problem is that it allows for an unbounded amount of computation to
> happen without a backing store.
> 
> 
> I would appreciate to hear what people think.
> 
> 


Hi,

In writing my own persistent store stuff I took the view that if there was any
chance of user intervention being viable then provide a handling hook.
Returning from the function is interpreted as continue as normal (unbounded
computation case), and up to the function whether to launch a thread continuation
or simply block  ie. a) display a window message and block further processing by not
returning immediately, or b) launch thread to display message, while returning
from handler.

Perhaps I could solicit opinion on something?
I see that by using a file based log, as does lpsm, it is trivial to detect its end.
But what if you were to use a block device?
Then it is not so simple because there is the matter of previously written log material
being wrongly interpreted as part of a valid log record.
I use a double root block scheme which reserves space at the head of the log device.
But I am not happy with it since it entails a change of disk head position after the
regular log record write. The alternative I guess is to include some sort of checksum
with the last record written so as to provide a measure of protection against
inadvertent collision with previous logging. But what is a computationally inexpensive
operation and what degree of certainty is enough!?
Have you thought about this issue at all?

Cheers,
Chris Q.