In the years since my first foray into consumer sleep monitors with the Zeo tracker, I have worn dozens of consumer and medical sleep monitoring devices, at home and in the sleep lab, and my research group has written about our perspectives along the way (here, here, and here). Despite the strong marketing claims, the reality is that we (physicians, patients, anyone really) still do not know exactly what to make of the outputs. Consumer monitors are not meant for medical use, and do not make diagnostic claims (and are not regulated by the FDA), yet sleep health straddles “medical” and “wellness” domains, and the separating boundary can seem blurry at times. I thought it worth sharing some of my self-testing experience as a “case” illustration of some of the challenges of basic interpretation if a person uses a sleep monitor to discover (non-medical) patterns for themselves (for more detailed look at the challenges and even risks, see our recent review. I used five devices simultaneously, to track my sleep for five nights, along with a simple daily diary. My simple plan was to ask three questions: How different would the sleep results be across the five devices? Would any of the devices show sleep correlations with my diary entries? Could I confirm any impressions on the first five nights with a repeated block of self-testing?
These are the five devices I used:
1) Sleep Cycle app (iPhone)
2) ResMed S+ (non-contact)
3) Jawbone Up3 (wristband)
4) Oura (ring)
5) Eight (mattress cover)
Figure 1 shows screen shots to allow visual comparison of each device’s output over five nights. We can see immediately that the sleep stages do not line up so well in the three devices that distinguish REM and NREM sleep (#2-4 in the list above), and some systematic differences are apparent, such as Oura reporting by far the highest percentage of REM per night, with Up3 reporting the least. The Eight and Sleep Cycle only report light and deep sleep (though it is unclear how this would translate into REM-NREM sub-stages). The Sleep Cycle score was not correlated with any other device measurement, perhaps not surprising as the least technically sophisticated of the group. However, none of the other device sleep-scores correlated with any others either. Looking at total sleep time, only the Eight and the Oura correlated with each other. Percentages of deep sleep and REM sleep (for the 3 devices reporting it) were uncorrelated. Whatever it is they are using to generate their metrics, the outputs were so different from one another, it felt as if the devices were measuring different people.
We have discussed issues of validation of consumer sleep monitors in a recent review. In this experiment, how would I go about thinking of which device is “right”, without the gold standard of polysomnography (PSG)? The short answer is, you can’t. But one could reasonably ask, does it matter which is closest to PSG staging, if my only goal is to learn patterns about myself over time?
Let’s unpack that idea – using my diaries – to see which variables were statistically correlated with features of each device, using some basic statistics to explore the data. I kept track of how much caffeine, alcohol, exercise, and stress I had each day. Only caffeine was significantly correlated with any sleep device measure: the number of awakenings detected by the Eight. Interestingly, stress and exercise in my diary entries were inversely related: it turned out I only exercised on low-stress days (or, since I only exercised on the weekend days, and they were low-stress, it was actually a weekend effect?)
Since reproducibility is the cornerstone of good science, and since I only recorded for five nights, I decided to repeat the testing a few weeks later for an additional block, this time of seven nights. The relation between diary entries and sleep device outputs was different: For about the same variation in caffeine use on a daily basis, on this week it was only correlated (negatively) with the REM percentage on the Oura ring. Exercise was not correlated with any metric (including stress level), this time around. Alcohol was now correlated (negatively) with the number of wakes recorded by the Oura ring (but no other correlations with alcohol). Stress level was positively correlated with deep sleep percentage on the Up3 and # of awakenings on the S+. As a self-tester, my conclusions about how my tracking seems to depend on which device I’m wearing – that can’t be good for science or discovery… how to explain this? Were there other factors that changed, which were not recorded (perhaps that I didn’t even realize, perhaps my diet)? Was it because the weeks actually differed in how the days-and-nights were connected? Did my recording block have too few nights to make reliable assessments?
We often hear about the “power” calculation of a clinical trial, and that small study samples cannot detect real differences. But it is also true that for small samples, false-positive results are more common. It certainly seems reasonable, for any person tracking their sleep, to ask how many nights should be recorded to make sure we arrive the “right” answer, or at least minimize the chance of false findings (positive or negative)? This would of course assume we know which device to pay attention to in the first place, given that they all seem to have their own, shall we say, perspective on what’s going on. Putting that question aside, the answer to “how long” will depend on several factors: how many diary factors will I record, how many features of sleep will I examine, how much variation is there night to night in my sleep (when holding diary-tracked factors constant, which may not be feasible!), and how much variation occurs in the diary factors (and are they correlated with one-another), what magnitude of effect is sought, and with what confidence? That sounds like a lot of factors to consider – it is a lot! This process is what we mentioned as a “power calculation”, and may not be straightforward to estimate for a given person (and surely it will differ from person to person). But without some estimation of these issues, even if the device is accurate, drawing conclusions about your own patterns in self-testing efforts will be challenging.
These are the five devices I used:
1) Sleep Cycle app (iPhone)
2) ResMed S+ (non-contact)
3) Jawbone Up3 (wristband)
4) Oura (ring)
5) Eight (mattress cover)
Figure 1 shows screen shots to allow visual comparison of each device’s output over five nights. We can see immediately that the sleep stages do not line up so well in the three devices that distinguish REM and NREM sleep (#2-4 in the list above), and some systematic differences are apparent, such as Oura reporting by far the highest percentage of REM per night, with Up3 reporting the least. The Eight and Sleep Cycle only report light and deep sleep (though it is unclear how this would translate into REM-NREM sub-stages). The Sleep Cycle score was not correlated with any other device measurement, perhaps not surprising as the least technically sophisticated of the group. However, none of the other device sleep-scores correlated with any others either. Looking at total sleep time, only the Eight and the Oura correlated with each other. Percentages of deep sleep and REM sleep (for the 3 devices reporting it) were uncorrelated. Whatever it is they are using to generate their metrics, the outputs were so different from one another, it felt as if the devices were measuring different people.
We have discussed issues of validation of consumer sleep monitors in a recent review. In this experiment, how would I go about thinking of which device is “right”, without the gold standard of polysomnography (PSG)? The short answer is, you can’t. But one could reasonably ask, does it matter which is closest to PSG staging, if my only goal is to learn patterns about myself over time?
Let’s unpack that idea – using my diaries – to see which variables were statistically correlated with features of each device, using some basic statistics to explore the data. I kept track of how much caffeine, alcohol, exercise, and stress I had each day. Only caffeine was significantly correlated with any sleep device measure: the number of awakenings detected by the Eight. Interestingly, stress and exercise in my diary entries were inversely related: it turned out I only exercised on low-stress days (or, since I only exercised on the weekend days, and they were low-stress, it was actually a weekend effect?)
Since reproducibility is the cornerstone of good science, and since I only recorded for five nights, I decided to repeat the testing a few weeks later for an additional block, this time of seven nights. The relation between diary entries and sleep device outputs was different: For about the same variation in caffeine use on a daily basis, on this week it was only correlated (negatively) with the REM percentage on the Oura ring. Exercise was not correlated with any metric (including stress level), this time around. Alcohol was now correlated (negatively) with the number of wakes recorded by the Oura ring (but no other correlations with alcohol). Stress level was positively correlated with deep sleep percentage on the Up3 and # of awakenings on the S+. As a self-tester, my conclusions about how my tracking seems to depend on which device I’m wearing – that can’t be good for science or discovery… how to explain this? Were there other factors that changed, which were not recorded (perhaps that I didn’t even realize, perhaps my diet)? Was it because the weeks actually differed in how the days-and-nights were connected? Did my recording block have too few nights to make reliable assessments?
We often hear about the “power” calculation of a clinical trial, and that small study samples cannot detect real differences. But it is also true that for small samples, false-positive results are more common. It certainly seems reasonable, for any person tracking their sleep, to ask how many nights should be recorded to make sure we arrive the “right” answer, or at least minimize the chance of false findings (positive or negative)? This would of course assume we know which device to pay attention to in the first place, given that they all seem to have their own, shall we say, perspective on what’s going on. Putting that question aside, the answer to “how long” will depend on several factors: how many diary factors will I record, how many features of sleep will I examine, how much variation is there night to night in my sleep (when holding diary-tracked factors constant, which may not be feasible!), and how much variation occurs in the diary factors (and are they correlated with one-another), what magnitude of effect is sought, and with what confidence? That sounds like a lot of factors to consider – it is a lot! This process is what we mentioned as a “power calculation”, and may not be straightforward to estimate for a given person (and surely it will differ from person to person). But without some estimation of these issues, even if the device is accurate, drawing conclusions about your own patterns in self-testing efforts will be challenging.

Figure 1. Comparison of device output across 5 nights. From top to bottom: S+, Up3, Oura, Sleep Cycle. The Eight mattress is not shown (output in i-phone app could not be captured on a single screen). I slept alone for all testing nights. From other testing, I know I do not have sleep apnea or periodic limb movements, so the focus is on the sleep architecture. The screen shot alignments are manually approximated with respect to time.
Contributed by: Dr Matt Bianchi
Contributed by: Dr Matt Bianchi