Article: Clinical Prediction Models for Sleep Apnea: The Importance of Medical History over Symptoms
If I had hair, it would have stood up on the back of my neck when I first saw this video. Heroes and rock stars can inspire, motivate, and even console. The video is gripping because we might not think of heroes and rock stars in certain setting. This is perhaps never more true than in statistics, or more generally, how we make sense of data in medical research. I admit to even being embarrassed in the past about this topic, which seems at best to evoke yawns, and at worst takes the blame for obfuscating arguments and twisting facts. But what if we had a hero of statistics – no superpowers, just a mortal who would be like Holden Caulfield, catching those who wander too close to the risky edge of data mis-interpretation, stepping in to keep us safe but also aware of the edge.
My hero is the Bayesian Statistics Avenger. And every hero needs a sidekick. Or three. The Avenger is more Socrates than Hulk, and less saving from an enemy, and more protection from wandering without context. The three minions each see data from a different perspective. Dr Yes is the optimist, singing the praises of the research. You’d think Dr Yes was a publicist for the researchers. Dr No is the pessimist, with a keen eye for spotting weakness and vulnerability. You’d think Dr No was a contrarian gladiator hired by competitors. Dr Go is the moderate soothsayer who wants everyone to just get along, offering concrete advice on how to bridge the gaps between Dr Yes and Dr No.
You may find that you gravitate toward one of these minions, or that you have some of each of them within you. Hopefully glancing at all three meets the goal: context. We’ll start with some of my own work, to kick off this series of entries… enjoy!
Contributed by Dr Matt Bianchi.
The circadian rhythm dominates discussions of jet lag. This is no accident: what could be more obvious than the body’s internal clock as the focus of how we should adjust to external clock changes as we travel across time zones? Well, like everything in life, the story is deeper than it first appears. Luckily, in the case of jet lag, to get the most out of travel planning, we only need to peak just beyond the circadian rhythm, to another internal “clock”, also managed by the brain (surprise!) called the sleep homeostat.
Before we talk about the sleep homeostat, let’s briefly visit the circadian part. When we travel west across time zones, we “gain” time before bed, in a sense, as the destination clock time is earlier compared to the home (departure) zone. Many people find such travel easier to manage, since it seems easier to stay up later than usual for many of us. But when we travel east across time zones, that is harder for most people, since it seems harder to go to bed earlier than usual. As far as wake up times, traveling west is awesome because we can sleep in (at least theoretically) compared to home – but traveling east is a bummer because we need to get up earlier than we are used to. This is all roughly true for typical travel of 2-8 time zones. The more the time zone change, the harder it is to adjust in general, and as we approach 12 hours of change, the direction of travel plays less of a direct role. Techniques to reduce jetlag that target the circadian system involve schedule shifts (moving your sleep schedule gradually over several days prior to travel to approach the destination zone), light exposure (to match ideal times or at least avoid counter-productive times of bright light), and sometimes also taking melatonin. Sites like www.jetlagrooster.com are useful for this kind of advice.
Now, back to the homeostat. This clock starts ticking when you wake up, and keeps track of how long you’ve been awake. For a person who typically sleeps from, say, 10pm to 6am, when night rolls in, there are two clocks say it’s time for sleep: the circadian clock that pays attention to the actual clock on the wall, and the homeostatic clock that says “you’ve been awake since 6am, so yes, you can sleep now”. Fun fact: when you drink coffee, it is thought that the caffeine tricks the homeostatic clock into thinking you have not been awake as long as you have. The homeostatic clock can also be “tricked” by naps. If you are a 10pm to 6am kind of person, and you happen to take a nap late in the day, say, from 7-9pm, the homeostatic clock senses the recent sleep, so it might be difficult if you try to sleep again at 10pm.
How does this relate to jet lag? Well, let’s say you are flying east, across 5 time zones, and you land at 9pm in the destination time zone. Say you are hoping to be in bed shortly thereafter, by 10pm. Your circadian system does not like this, because it is only 5pm in your home zone, which it is accustomed to. Well, let’s also say you happened to also doze for 3 hours, in the second half of the flight. Now, your homeostat is also not happy, since you just slept recently, before trying to go to bed at 10. You have two clocks working against you! The point is, depending on the time of day you are landing in the destination time zone, you would be wise to plan your in-flight dozing to please your sleep homeostat. For evening destination landings, you’d be wise to limit your in flight sleep, and if you do sleep, try to front-load it at the beginning of the flight, so you have been awake for a while upon landing, and have built up some homeostatic drive to sleep. This thinking applies to morning arrivals as well: in that case, sleep as much as you can and as close to landing as you can, so you wake up with your homeostat on your side to maintain alertness during the day at destination.
Contributed by: Dr Matt Bianchi.
It is remarkable to think of how long insomnia has been written about in the medical and general literature, with remedies of alcohol and poppy dating back nearly as far as the earliest known writing (~7500 BCE). Even more remarkable is that not a single study, ever, has documented health benefits of taking a drug to treat insomnia. Of course, symptom improvement is presumably the reason that some individuals take such medications. However, the growing literature of adverse effects of hypnotic drugs has actually been building for decades. Each generation of sleeping pills claims to be safe and effective, until we realize the “safe” part was not quite right, and each new generation of drugs is initially thought (and advertised as such) to be safer. Figure 1 is a historical timeline of insomnia drugs, with the most recent ~75 years referring mainly to the USA market. In retrospect, we may be tempted to think that the safety issues of prior-generation drugs were obvious – but the cycle seems to repeat. Even thalidomide, the drug the left legions of children with terrible deformities, was initially sold as safe enough even for use in pregnancy for sleep. What will the next generation of sleep physicians be writing about our “modern” sleeping pills, and their claims of improved safety?
Throughout these generations of drugs, the “effective” part has been left essentially up to improvement of symptoms. This seems simple enough – what more is needed than the patient’s experience of improvement? Of course symptom resolution is important, but it cannot be the only standard. If it were, then why would physicians counsel patients against using alcohol (or opium, for that matter) to assist with sleeplessness? Well, hopefully that answer also seems obvious: because those substances carry substantial potential risks. Alcohol and opiates can actually make sleep objectively worse, even if a person feels subjectively that they slept “better”. So, clearly, we need to consider any drug remedy for sleep as a risk-benefit balance. And surely some patients will accept a purely subjective sense of “better” sleep, even if objective sleep is unchanged (or even worse), and even if substantial side effect risk is incurred, because the symptom relief is so important to the patient.
What we must avoid is the sense that this is an “easy” question, and we must not overlook the need to navigate the risk-benefit balance for each patient individually. It is not at all easy to judge the objective health benefits of sleeping pills. In fact, all published long-term studies of sleeping pill use and medical health have shown only risk . And while I have been critical of some of that literature , such a lopsided literature should give us pause, especially since I am certain that many patients may be taking comfort in the thought that their treatment is improving their health. Is it? Even the health risks (forgetting whether medications help them) of chronic insomnia have been strikingly questioned by the only large study of insomnia that actually measured objective sleep . In that study, only those with both insomnia symptoms and objective short sleep time carried medical and psychiatric risk over the 10 year time-frame. Insomnia without objective short sleep, or objective short sleep without insomnia, did not carry these risks. Physicians do not routinely test the objective sleep in patients with chronic insomnia, though increasing data suggests we should be doing more of this – even as insurance increasingly restricts laboratory polysomnography in favor of simplistic sleep apnea kits that don’t measure sleep at all. Clearly, we cannot consider insomnia a simple disorder, nor can we consider drug therapy a simple solution. Even the most up-to-date review of the literature describes the evidence as “weak”, and offers mainly consensus (opinion) advice . One thing is certain: we have a lot of work to do in this field.
 Kripke (2016) Hypnotic drug risks of mortality, infection, depression, and cancer: but lack of benefit.
 Bianchi et al (2012) Hypnotics and mortality risk. J Clin Sleep Med. 8(4):351-2.
 Vgontzas et al (2013) Insomnia with Objective Short Sleep Duration: the Most Biologically Severe Phenotype of the Disorder.
 Sateia et al (2017) Clinical Practice Guideline for the Pharmacologic Treatment of Chronic Insomnia in Adults: An American Academy of Sleep Medicine Clinical Practice Guideline. J Clin Sleep Med. 13(2):307-349.
Contributed by: Matt Bianchi MD PhD
In the years since my first foray into consumer sleep monitors with the Zeo tracker, I have worn dozens of consumer and medical sleep monitoring devices, at home and in the sleep lab, and my research group has written about our perspectives along the way (here, here, and here). Despite the strong marketing claims, the reality is that we (physicians, patients, anyone really) still do not know exactly what to make of the outputs. Consumer monitors are not meant for medical use, and do not make diagnostic claims (and are not regulated by the FDA), yet sleep health straddles “medical” and “wellness” domains, and the separating boundary can seem blurry at times. I thought it worth sharing some of my self-testing experience as a “case” illustration of some of the challenges of basic interpretation if a person uses a sleep monitor to discover (non-medical) patterns for themselves (for more detailed look at the challenges and even risks, see our recent review. I used five devices simultaneously, to track my sleep for five nights, along with a simple daily diary. My simple plan was to ask three questions: How different would the sleep results be across the five devices? Would any of the devices show sleep correlations with my diary entries? Could I confirm any impressions on the first five nights with a repeated block of self-testing?
These are the five devices I used:
1) Sleep Cycle app (iPhone)
2) ResMed S+ (non-contact)
3) Jawbone Up3 (wristband)
4) Oura (ring)
5) Eight (mattress cover)
Figure 1 shows screen shots to allow visual comparison of each device’s output over five nights. We can see immediately that the sleep stages do not line up so well in the three devices that distinguish REM and NREM sleep (#2-4 in the list above), and some systematic differences are apparent, such as Oura reporting by far the highest percentage of REM per night, with Up3 reporting the least. The Eight and Sleep Cycle only report light and deep sleep (though it is unclear how this would translate into REM-NREM sub-stages). The Sleep Cycle score was not correlated with any other device measurement, perhaps not surprising as the least technically sophisticated of the group. However, none of the other device sleep-scores correlated with any others either. Looking at total sleep time, only the Eight and the Oura correlated with each other. Percentages of deep sleep and REM sleep (for the 3 devices reporting it) were uncorrelated. Whatever it is they are using to generate their metrics, the outputs were so different from one another, it felt as if the devices were measuring different people.
We have discussed issues of validation of consumer sleep monitors in a recent review. In this experiment, how would I go about thinking of which device is “right”, without the gold standard of polysomnography (PSG)? The short answer is, you can’t. But one could reasonably ask, does it matter which is closest to PSG staging, if my only goal is to learn patterns about myself over time?
Let’s unpack that idea – using my diaries – to see which variables were statistically correlated with features of each device, using some basic statistics to explore the data. I kept track of how much caffeine, alcohol, exercise, and stress I had each day. Only caffeine was significantly correlated with any sleep device measure: the number of awakenings detected by the Eight. Interestingly, stress and exercise in my diary entries were inversely related: it turned out I only exercised on low-stress days (or, since I only exercised on the weekend days, and they were low-stress, it was actually a weekend effect?)
Since reproducibility is the cornerstone of good science, and since I only recorded for five nights, I decided to repeat the testing a few weeks later for an additional block, this time of seven nights. The relation between diary entries and sleep device outputs was different: For about the same variation in caffeine use on a daily basis, on this week it was only correlated (negatively) with the REM percentage on the Oura ring. Exercise was not correlated with any metric (including stress level), this time around. Alcohol was now correlated (negatively) with the number of wakes recorded by the Oura ring (but no other correlations with alcohol). Stress level was positively correlated with deep sleep percentage on the Up3 and # of awakenings on the S+. As a self-tester, my conclusions about how my tracking seems to depend on which device I’m wearing – that can’t be good for science or discovery… how to explain this? Were there other factors that changed, which were not recorded (perhaps that I didn’t even realize, perhaps my diet)? Was it because the weeks actually differed in how the days-and-nights were connected? Did my recording block have too few nights to make reliable assessments?
We often hear about the “power” calculation of a clinical trial, and that small study samples cannot detect real differences. But it is also true that for small samples, false-positive results are more common. It certainly seems reasonable, for any person tracking their sleep, to ask how many nights should be recorded to make sure we arrive the “right” answer, or at least minimize the chance of false findings (positive or negative)? This would of course assume we know which device to pay attention to in the first place, given that they all seem to have their own, shall we say, perspective on what’s going on. Putting that question aside, the answer to “how long” will depend on several factors: how many diary factors will I record, how many features of sleep will I examine, how much variation is there night to night in my sleep (when holding diary-tracked factors constant, which may not be feasible!), and how much variation occurs in the diary factors (and are they correlated with one-another), what magnitude of effect is sought, and with what confidence? That sounds like a lot of factors to consider – it is a lot! This process is what we mentioned as a “power calculation”, and may not be straightforward to estimate for a given person (and surely it will differ from person to person). But without some estimation of these issues, even if the device is accurate, drawing conclusions about your own patterns in self-testing efforts will be challenging.
Figure 1. Comparison of device output across 5 nights. From top to bottom: S+, Up3, Oura, Sleep Cycle. The Eight mattress is not shown (output in i-phone app could not be captured on a single screen). I slept alone for all testing nights. From other testing, I know I do not have sleep apnea or periodic limb movements, so the focus is on the sleep architecture. The screen shot alignments are manually approximated with respect to time.
Contributed by: Dr Matt Bianchi
Disease screening is an important part of medicine, as it can help doctors and patients prioritize who may warrant further testing and/or treatment. Often a trade-off accompanies screening: ease of use and low cost (benefits) are balanced against the sub-optimal accuracy of the screen (risk). Beyond this balancing act, we cannot stop with the result in hand - the result also requires proper interpretation by the doctor. No test in medicine is perfect. One of the most important skills a physician can develop is an intuition for recognizing false-positive and false negative results when testing for any disease. The intuition can be qualitatively illustrated in two examples. if I strongly suspect that a patient has a disease, but their screen test comes back negative, I might reasonably wonder whether the screen was a false-negative. Consider a physician evaluating a negative screening questionnaire for sleep apnea from a patient who is obese and has other medical problems often linked to sleep apnea (e.g. heart disease, diabetes, difficult to control hypertension, etc) – it would be a mistake to blindly trust the screening result in this case. Likewise, if a screen result comes back positive in a patient with very low risk of sleep apnea, we may reasonably wonder if the result is a false-positive.
How often does this really happen, and how can doctors improve their interpretation of screening tests especially when the result is unexpected compared to their clinical suspicions? Let’s look at the “STOP-BANG” questionnaire, which has 8 items and is considered the best-validated tool for sleep apnea screening. The acronym stands for snoring, tired, observed apnea, pressure (hypertension), BMI >35, age >50, neck circumference >40, gender (male). When the initial validation of this tool was published over 5 years ago, it was discussed as if the 8-item score indicated a patient’s risk. While this may seem reasonable on the surface, remember that we need that third ingredient of context: how likely was the patient to have sleep apnea in the first place, before we used the screen? Interested readers can see my editorial on how to interpret their initial validation results here. The need to combine prior information, with current data, is exactly where Bayes’ Theorem helps us: the sensitivity and specificity of the test is combined with the pre-test probability of disease, to yield a new probability of disease (the “post-test probability”). See our recent blog on this topic for more details.
Because of the ongoing interest in screening for OSA, several publications have emerged since then. Most recently, the original group published a new analysis, attempting to improve the screen. Fortunately, they do discuss the pre-test probability. Unfortunately, they continue to claim that the score indicates the patients risk, which creates a paradox. In their publication, they performed polysomnography to know what portion was normal and what portions were diagnosed with mild, moderate, and severe sleep apnea. It turns out that about one-third of the patients they tested had at least moderate sleep apnea. By using this value as the pre-test probability, and then updating the probability after a positive STOP-BANG test, the post-test probability of sleep apnea was over 50%. It seems reasonable to call this post-test probability a “high risk” for sleep apnea. But it was >50% because of all three ingredients – not just the test result – but also the pre-test probability. For those with a normal STOP-BANG score, they assign the label “low risk”. One could argue whether the post-test probability here, of 20%, of at least moderate sleep apnea should really be considered “low”, especially since it is far higher than reported in the general adult population. But that is not the paradox – that comes if we think about a positive result from the very same tool, but in a population where only 5% are expected to have sleep apnea. Remember, their claim is that a positive screen result means high risk. But in this kind of population, the positive screen result gives us a post-test probability lower than 20% - which the authors just said should be called “low risk”. This apparent paradox is understood, and avoided, by simply remembering Bayes’ Theorem: without knowing all three ingredients, we cannot interpret the test properly. Put another way, the risk is not simply given by the test result – it must be interpreted in context, and that context is the pre-test probability.
Contributed by: Matt Bianchi MD PhD
I want to sell you this nickel – it has special powers to detect sleep apnea. When you toss it in front of a patient, heads is a positive test for sleep apnea. Since we are in the age of evidence-based medicine, I want to convince you this is the real deal, let me share the data with you. The experiment was to flip the coin for 180 people already diagnosed with sleep apnea (the true positives), and 20 healthy adults with no sleep disorders (the true negatives). We found that the nickel was correct 90% of the time when it came up heads! In statistical parlance, this means the nickel test has a 90% positive predictive value – excellent!
Let’s walk through the data calculations, in case you remain skeptical of the magic nickel. The 2x2 box (Figure A) shows how a standard experiment is set up. The left column contains the true cases and the right column contains the healthy controls. These columns represent the true or gold standard status of each person. The top row contains those who had a positive coin toss (heads), and the bottom row contains those who had a negative coin toss (tails). The sensitivity is the proportion of positive tests from the pool of known cases, and the specificity is the proportion of negative tests from the pool of healthy controls. The positive predictive value is the proportion of true positives over all who tested positive, while the negative predictive value is the proportion of true negatives over all who tested negative.
My coin is actually so special that it actually performs magic even though it has an equal chance of showing heads or tails when you flip it. Figure (B) shows that the sensitivity and specificity are each 50%, which would be expected for an ordinary coin. The magic comes when we calculate the positive predictive value: this is defined as the true positives (upper left box) divided by all positives (sum of the upper left and the upper right boxes). We see this is 90 divided by 90+10 (i.e., 100), which equals 90%! The math clearly shows the coin has the magic touch!
WAIT - Where is the trick? How can a fair coin, randomly showing up heads or tails, correctly predict a person’s disease status? Luckily, with a little help from Bayes’ Theorem we can see right through this smoke screen. In this experiment, 90% of the people tested had sleep apnea (180 of the 200 total). The coin flip was really just a random result. It came up heads for half of the people, as we would expect for a normal coin. If we looked at a random group from our experiment (say, half of them), we would expect 90% of them to have sleep apnea. The coin did just that – randomly “picked” half of the people as heads – and thus it didn’t tell us anything at all beyond what we already would have guessed. Likewise, when the coin came up tails, it was only “right” 10% of the time – because only 10% of the population did not have sleep apnea.
Bayes’ Theorem tells us that we need three ingredients to make sense of any test result: sensitivity, specificity, and pre-test probability. Tragedy awaits any who tread the fields of test interpretation without being armed by all three. The sensitivity is the proportion of true cases correctly identified by the test as having the disease – the coin has a sensitivity of 50%. The specificity is the proportion of healthy people the test correctly identifies as NOT having the disease – the coin has a specificity of 50%. The pre-test probability is the portion of the population being tested who have the disease – in this experiment, it was 90%.
Now that is a lot of math – if only there were a quick rule of thumb so you can’t be fooled by people trying to sell magic coins… there is! I call it the “Rule of 100”. If any test has a sensitivity and specificity that add up to 100%, the test is performing at chance level – like a random coin. In case you are wondering if this is something special about the 50%-50% coin example, Figure (C) shows the same calculations as (B) but now using a coin that is biased towards heads, which comes up 90% of the time. In this case, sensitivity (90%) plus specificity (10%) adds to 100, and we see the positive predictive value is still 90%, just like when the first example magic coin was actually an ordinary fair coin.
Why would anyone make a test that was nothing more than a random number generator? It’s the tragedy of failing to consider all three ingredients. It happens unfortunately more than we would like. Although there are many published examples, one recent article serves to highlight the problem: reported screening test results for sleep apnea tool that violates the Rule of 100, but was actually called “accurate” by the authors (see my blog posting on this paper).
Contributor: Matt Bianchi MD PhD
On May 13, 2015, ResMed announced a safety alert ahead of publication of a large trial of their adaptive PAP system (“ASV”) for heart failure patients with central apnea . The primary endpoint that the trial was designed to answer showed no effect of ASV. The unexpected finding came in an exploratory subset analysis (not the main goal of the trial): higher cardiovascular mortality in those using ASV (annual risk of 10% versus 7.5%). On May 15, the American Academy of Sleep Medicine posted their initial response to this announcement. In June 2015, the annual SLEEP conference hosted a special session in which cogent criticisms and concerns were raised in a balanced discussion about the trial, but the risk announcements had already been made public. A recent editorial detailed many points of uncertainty about the trial . It remains unknown why the main outcome was negative, and why the subset analysis suggested increased cardiac risk with ASV use.