Analysis of censored data

Censored data occurs when you know that a measurement exceeds some threshold, but you don’t know by how much. (There is a less common kind of censored data where you know that a measurement falls below some threshold, but do not know by how much.) As an example of censored data, suppose you watch people as they try to solve a problem and record how long each person takes to solve it. Suppose that you don’t want to spend more than 10 minutes waiting for a person to reach a solution, so that if a person has not solved the problem in 10 minutes, you call a halt and record the fact that “time to solve” was greater than 10 minutes.  If five people solve the problem and two don’t, the data from seven people might look like this:

Case

Time to solve

1

6

2

2

3

9

4

>10

5

4

6

9

7

>10

One non-optimal method for dealing with cases 4 and 7 is to treat the censored values as missing. Another non-optimal method is to substitute an arbitrary number like 11 or 12 for the censored values (which are known only to be greater than 10). Treating cases 4 and 7 as missing has the effect of biasing the sample by excluding poor problem solvers. Substituting an arbitrary number for a censored value is also undesirable, although the exact effect of substituting an arbitrary number is impossible to know.

In the Bayesian approach to censored data you can take advantage of all the information you have about cases 4 and 7 without making assumptions other than the assumption of normality.

This video (16 minutes and 10 seconds) shows how to fit a regression model using censored data. It is based on Example 32 (English/Japanese) in the User's Guide .