I have inherited a data set that includes 97 surveys. However, due to de-identification, it is not known whether some of these surveys were completed by the same person twice (e.g. returned to more than one class). Two of the main items on the survey include a pre- and post-mood rating on a 5-point Likert scale. I was hoping to do a paired samples t-test. However, due to this unknown, there is a potential that it violates the assumption of independence of observations. If I am transparent in the write-up about this potential limitation and inflation to type 1 error, would it be appropriate for me to still conduct a paired samples t-test? Otherwise, are there any other statistical tests I could use or would Bootstrapping help to overcome this?
猫咪踩奶是什么意思
-
$\begingroup$ It looks like you use the term "survey" here for a single observation, i.e., all items completed by a single person. Is that so? (Standard reading could be that you may have 97 full datasets each with many persons from 97 different surveys, but this isn't what you mean, or is it?) $\endgroup$– Christian HennigCommented 18 hours ago
3 Answers
In general, statistical model assumptions are formal idealisations that live in the world of mathematics, not in the real world, and model assumptions are never perfectly fulfilled in the real world. So we regularly apply statistical methods in situations in which the formal model assumptions are not fulfilled, and it doesn't make sense to demand that model assumptions must be fulfilled.
This doesn't mean, however, that model assumptions can be ignored. The relevant question is whether model assumptions are violated in ways that quite likely may mislead the conclusions. For deciding whether this may be the case, we need to understand the implications of potential violations of model assumptions.
Let's say you have 97 observations (pre- and post-mood), but some of these come from the same person. If the same person always gives the same answers, this means (a) that your effective sample size (related to the actual information content in the sample) is lower than 97, namely if only 65 different persons took part, you only have 65 independent "units" of information here, not 97 (assuming that at least different persons are independent, which in many situations cannot be taken for granted either). Furthermore, these 65 persons are not treated in the optimal way, because they don't have all the same weight (a person with 3 "data points" in the sample has three times the weight of a person with only one). If persons who return do not give exactly the same answers as before, it's not quite like this, but still, you only have 65 independent units of information. Recurring observations of duplicate persons tell you something about "within person variation", which is apparently not of interest here.
This means that the test statistic will, under null hypothesis, have more variation than what is assumed in the computation of the test's p-value. Consequently, computed p-values will be too small. For example, if you compute a p-value of 0.03 under independence assumption, this means that a 95% confidence interval for your true effect will just not include 0. But if we could precisely assess the variation of the test statistic with only 65 different participants, which is bigger than what the theory based on independence suggests, we would find that the confidence interval (the size of which is directly related to the variation of the test statistic) might be quite a bit larger, include 0, and be associated to a p-value of larger than 0.05.
Just to give you an impression what could happen, you could manually re-compute your t-statistic and p-value from your 97 observations but using 65 for the sample size $n$ in the formulae instead of 97. This is not precise as it ignores how exactly the standard error estimate came about and could be affected, and other things, but it should give you an idea of the ballpark where this could land. (Don't use this in a publication as technically it is not correct.)
Now, on top of this, your problem is that you actually don't know how many duplicate participants you have and who they are. Without assuming any prior knowledge about this, I'd say "hands off", as the result can really be grossly misleading.
If, however, you have good arguments from your knowledge of the survey that duplicate participants are very unlikely, and you may have, say, one or two, and maybe even none, and most likely no more than two, something like this, I'd think that you could do it, with caution, explaining exactly the situation, the fact that the p-value is likely somewhat too low, and in particular why you believe that the problem is small because chances are there are hardly any duplicates. If you observe a p-value of say $10^{-5}$, chances still are something is going on. Nobody can give you a guarantee that this will be fine, but (a) you are transparent about this, and (b) you hopefully have good arguments that the impact of the problem is small.
I'm not very optimistic that any straight approach will give you something more reliable, as your state of information isn't helpful (not knowing how many and which duplicates). So also things like bootstrap have nothing more to work with. The only thing I can imagine that could be done is a Bayesian analysis where you model the occurrence of duplicates explicitly and specify a prior distribution for how many there might be. Also this would require a prior distribution for how the within-person variation may relate to the between-person variation, so it'll not be a piece of cake to do, and I have no idea how strong or weak your information is that you can use for setting up such a model.
I would be amazed if there were any tests that could overcome an unknown amount of repeated tests. You could try to identify some duplicate people by looking at patterns of answers, but that's very messy.
However, if you are clear about the situation in your write up, I think you can do it. If your topic is highly controversial, or likely to get a lot of attention, I would be more leery of doing so, because findings get quoted and limitations don't.
You might have to start over. Sorry.
There is nothing to add to @ChristianHenning's thorough answer (if you can be sure that only a small proportion (2,3,4%?) of your almost 100 observations are "repeats", then, disclose the problem, explain why you believe firmly that it affects only a small proportion, and run your test; you will not be far off, and if the p-value if far from your significance level, your results will stand).
But... You have another issue, which (from my pov) is as problemetic, if not worse. And that is that were "hoping to do a paired samples t-test", on data which comes from "a 5-point Likert scale". T-tests are appropriate for interval or ratio-scale data (see Wikipedia). The "difference" between an answer of 2 and an answer of 3, and between an answer of 4 and an answer of 5, while both 1, are not the same, psychometrically (it depends on the exact wording, the subject's interpretation of it, etc...). The math just does not work. To see some of the issues, you may want to read here, or here, or here (and many more).
Now, you will not be the first to do such a thing, nor the last (sadly). So if you must do it that way, pinch your nose and do it. But truly, if you are dealing with answers on a 5-point Likert scale, you should use statistical methods suitable for ordinal-scale data; e.g. how many subjects rated higher, how many rated lower, and how compatible is that with a 50/50 split? This is just a very basic one, and there are many other suitable methods, but a t-test is not one of them...