MGH Sleep Center
  • Home
    • Contact
  • For Patients
    • Patient Intake Form
    • Tips for Sleeping Better
    • Symptoms
    • Description of Sleep Disorders
    • Sleep Study FAQs
    • Position Therapy
    • Printable Resources
  • For Providers
    • Consult, Sleep Study, or Both?
    • Choosing the Right Sleep Test
    • Home Testing
    • The Polysomnogram
    • Tips for Managing Sleep Apnea
    • Position Therapy
    • Tips for Managing Insomnia
    • Services Offered
  • Sleep Center
    • Services Offered
    • Affiliated Faculty
    • Sleep Lab Technicians
    • History of the MGH Neurology Sleep Lab
    • What's New?
  • Sleep Research
    • Matt Bianchi, MD, PhD, MMSc
    • Leonard B. Kaban, DMD, MD, FACS
    • Bernard Kinane, MD
    • Aleksandar Videnovic, MD
    • John W. Winkelman, MD, PhD
    • Michael J. Prerau, Ph.D.
    • Support Research- Neurology Development Office
  • FAQs
  • Director's Blog

Reflections on drug therapy for insomnia

3/6/2017

 
     ​It is remarkable to think of how long insomnia has been written about in the medical and general literature, with remedies of alcohol and poppy dating back nearly as far as the earliest known writing (~7500 BCE).  Even more remarkable is that not a single study, ever, has documented health benefits of taking a drug to treat insomnia.  Of course, symptom improvement is presumably the reason that some individuals take such medications.  However, the growing literature of adverse effects of hypnotic drugs has actually been building for decades.  Each generation of sleeping pills claims to be safe and effective, until we realize the “safe” part was not quite right, and each new generation of drugs is initially thought (and advertised as such) to be safer.  Figure 1 is a historical timeline of insomnia drugs, with the most recent ~75 years referring mainly to the USA market.  In retrospect, we may be tempted to think that the safety issues of prior-generation drugs were obvious – but the cycle seems to repeat.  Even thalidomide, the drug the left legions of children with terrible deformities, was initially sold as safe enough even for use in pregnancy for sleep.  What will the next generation of sleep physicians be writing about our “modern” sleeping pills, and their claims of improved safety?
​     Throughout these generations of drugs, the “effective” part has been left essentially up to improvement of symptoms.  This seems simple enough – what more is needed than the patient’s experience of improvement?  Of course symptom resolution is important, but it cannot be the only standard.  If it were, then why would physicians counsel patients against using alcohol (or opium, for that matter) to assist with sleeplessness?  Well, hopefully that answer also seems obvious:  because those substances carry substantial potential risks.  Alcohol and opiates can actually make sleep objectively worse, even if a person feels subjectively that they slept “better”.  So, clearly, we need to consider any drug remedy for sleep as a risk-benefit balance.  And surely some patients will accept a purely subjective sense of “better” sleep, even if objective sleep is unchanged (or even worse), and even if substantial side effect risk is incurred, because the symptom relief is so important to the patient.
​     What we must avoid is the sense that this is an “easy” question, and we must not overlook the need to navigate the risk-benefit balance for each patient individually.  It is not at all easy to judge the objective health benefits of sleeping pills.  In fact, all published long-term studies of sleeping pill use and medical health have shown only risk [1].  And while I have been critical of some of that literature [2], such a lopsided literature should give us pause, especially since I am certain that many patients may be taking comfort in the thought that their treatment is improving their health.  Is it?  Even the health risks (forgetting whether medications help them) of chronic insomnia have been strikingly questioned by the only large study of insomnia that actually measured objective sleep [3].  In that study, only those with both insomnia symptoms and objective short sleep time carried medical and psychiatric risk over the 10 year time-frame.  Insomnia without objective short sleep, or objective short sleep without insomnia, did not carry these risks.  Physicians do not routinely test the objective sleep in patients with chronic insomnia, though increasing data suggests we should be doing more of this – even as insurance increasingly restricts laboratory polysomnography in favor of simplistic sleep apnea kits that don’t measure sleep at all.  Clearly, we cannot consider insomnia a simple disorder, nor can we consider drug therapy a simple solution.  Even the most up-to-date review of the literature describes the evidence as “weak”, and offers mainly consensus (opinion) advice [4].  One thing is certain: we have a lot of work to do in this field.
Figure:
Picture
References: 

[1]  Kripke (2016) Hypnotic drug risks of mortality, infection, depression, and cancer: but lack of benefit.

[2] Bianchi et al (2012) Hypnotics and mortality risk. J Clin Sleep Med. 8(4):351-2. 
 
[3] Vgontzas et al (2013) Insomnia with Objective Short Sleep Duration: the Most Biologically Severe Phenotype of the Disorder. 
 
[4] Sateia et al (2017) Clinical Practice Guideline for the Pharmacologic Treatment of Chronic Insomnia in Adults: An American Academy of Sleep Medicine Clinical Practice Guideline.  J Clin Sleep Med. 13(2):307-349.  
​
Contributed by:  Matt Bianchi MD PhD

Dr Bianchi’s latest self-testing: five consumer sleep monitors

1/30/2017

 
     In the years since my first foray into consumer sleep monitors with the Zeo tracker, I have worn dozens of consumer and medical sleep monitoring devices, at home and in the sleep lab, and my research group has written about our perspectives along the way (here, here, and here). Despite the strong marketing claims, the reality is that we (physicians, patients, anyone really) still do not know exactly what to make of the outputs.  Consumer monitors are not meant for medical use, and do not make diagnostic claims (and are not regulated by the FDA), yet sleep health straddles “medical” and “wellness” domains, and the separating boundary can seem blurry at times.  I thought it worth sharing some of my self-testing experience as a “case” illustration of some of the challenges of basic interpretation if a person uses a sleep monitor to discover (non-medical) patterns for themselves (for more detailed look at the challenges and even risks, see our recent review. I used five devices simultaneously, to track my sleep for five nights, along with a simple daily diary.  My simple plan was to ask three questions:  How different would the sleep results be across the five devices?   Would any of the devices show sleep correlations with my diary entries?  Could I confirm any impressions on the first five nights with a repeated block of self-testing?

These are the five devices I used: 
          1) Sleep Cycle app (iPhone)
          2) ResMed S+ (non-contact)
          3) Jawbone Up3 (wristband)
          4) Oura (ring)
          5) Eight (mattress cover) 

     Figure 1 shows screen shots to allow visual comparison of each device’s output over five nights.  We can see immediately that the sleep stages do not line up so well in the three devices that distinguish REM and NREM sleep (#2-4 in the list above), and some systematic differences are apparent, such as Oura reporting by far the highest percentage of REM per night, with Up3 reporting the least. The Eight and Sleep Cycle only report light and deep sleep (though it is unclear how this would translate into REM-NREM sub-stages).  The Sleep Cycle score was not correlated with any other device measurement, perhaps not surprising as the least technically sophisticated of the group.  However, none of the other device sleep-scores correlated with any others either.  Looking at total sleep time, only the Eight and the Oura correlated with each other.  Percentages of deep sleep and REM sleep (for the 3 devices reporting it) were uncorrelated.  Whatever it is they are using to generate their metrics, the outputs were so different from one another, it felt as if the devices were measuring different people.
     We have discussed issues of validation of consumer sleep monitors in a recent review. In this experiment, how would I go about thinking of which device is “right”, without the gold standard of polysomnography (PSG)?  The short answer is, you can’t.  But one could reasonably ask, does it matter which is closest to PSG staging, if my only goal is to learn patterns about myself over time? 
​     Let’s unpack that idea – using my diaries – to see which variables were statistically correlated with features of each device, using some basic statistics to explore the data.  I kept track of how much caffeine, alcohol, exercise, and stress I had each day.  Only caffeine was significantly correlated with any sleep device measure: the number of awakenings detected by the Eight.  Interestingly, stress and exercise in my diary entries were inversely related: it turned out I only exercised on low-stress days (or, since I only exercised on the weekend days, and they were low-stress, it was actually a weekend effect?)  
     Since reproducibility is the cornerstone of good science, and since I only recorded for five nights, I decided to repeat the testing a few weeks later for an additional block, this time of seven nights.  The relation between diary entries and sleep device outputs was different:  For about the same variation in caffeine use on a daily basis, on this week it was only correlated (negatively) with the REM percentage on the Oura ring.  Exercise was not correlated with any metric (including stress level), this time around.  Alcohol was now correlated (negatively) with the number of wakes recorded by the Oura ring (but no other correlations with alcohol).  Stress level was positively correlated with deep sleep percentage on the Up3 and # of awakenings on the S+.  As a self-tester, my conclusions about how my tracking seems to depend on which device I’m wearing – that can’t be good for science or discovery… how to explain this?  Were there other factors that changed, which were not recorded (perhaps that I didn’t even realize, perhaps my diet)?  Was it because the weeks actually differed in how the days-and-nights were connected?  Did my recording block have too few nights to make reliable assessments?  
     We often hear about the “power” calculation of a clinical trial, and that small study samples cannot detect real differences.  But it is also true that for small samples, false-positive results are more common. It certainly seems reasonable, for any person tracking their sleep, to ask how many nights should be recorded to make sure we arrive the “right” answer, or at least minimize the chance of false findings (positive or negative)?  This would of course assume we know which device to pay attention to in the first place, given that they all seem to have their own, shall we say, perspective on what’s going on.  Putting that question aside, the answer to “how long” will depend on several factors: how many diary factors will I record, how many features of sleep will I examine, how much variation is there night to night in my sleep (when holding diary-tracked factors constant, which may not be feasible!), and how much variation occurs in the diary factors (and are they correlated with one-another), what magnitude of effect is sought, and with what confidence?  That sounds like a lot of factors to consider – it is a lot!  This process is what we mentioned as a “power calculation”, and may not be straightforward to estimate for a given person (and surely it will differ from person to person). But without some estimation of these issues, even if the device is accurate, drawing conclusions about your own patterns in self-testing efforts will be challenging.

Picture
Figure 1. Comparison of device output across 5 nights. From top to bottom: S+, Up3, Oura, Sleep Cycle. The Eight mattress is not shown (output in i-phone app could not be captured on a single screen). I slept alone for all testing nights. From other testing, I know I do not have sleep apnea or periodic limb movements, so the focus is on the sleep architecture. The screen shot alignments are manually approximated with respect to time.
​

​Contributed by:  Dr Matt Bianchi

More paradoxes in published sleep apnea screening – Bayes to the rescue!

12/15/2016

 
     Disease screening is an important part of medicine, as it can help doctors and patients prioritize who may warrant further testing and/or treatment.  Often a trade-off accompanies screening: ease of use and low cost (benefits) are balanced against the sub-optimal accuracy of the screen (risk).  Beyond this balancing act, we cannot stop with the result in hand - the result also requires proper interpretation by the doctor.  No test in medicine is perfect.  One of the most important skills a physician can develop is an intuition for recognizing false-positive and false negative results when testing for any disease.  The intuition can be qualitatively illustrated in two examples.  if I strongly suspect that a patient has a disease, but their screen test comes back negative, I might reasonably wonder whether the screen was a false-negative.  Consider a physician evaluating a negative screening questionnaire for sleep apnea from a patient who is obese and has other medical problems often linked to sleep apnea (e.g. heart disease, diabetes, difficult to control hypertension, etc) – it would be a mistake to blindly trust the screening result in this case.  Likewise, if a screen result comes back positive in a patient with very low risk of sleep apnea, we may reasonably wonder if the result is a false-positive. 
     How often does this really happen, and how can doctors improve their interpretation of screening tests especially when the result is unexpected compared to their clinical suspicions?  Let’s look at the “STOP-BANG” questionnaire, which has 8 items and is considered the best-validated tool for sleep apnea screening.  The acronym stands for snoring, tired, observed apnea, pressure (hypertension), BMI >35, age >50, neck circumference >40, gender (male).  When the initial validation of this tool was published over 5 years ago, it was discussed as if the 8-item score indicated a patient’s risk.  While this may seem reasonable on the surface, remember that we need that third ingredient of context: how likely was the patient to have sleep apnea in the first place, before we used the screen?  Interested readers can see my editorial on how to interpret their initial validation results here. The need to combine prior information, with current data, is exactly where Bayes’ Theorem helps us:  the sensitivity and specificity of the test is combined with the pre-test probability of disease, to yield a new probability of disease (the “post-test probability”).  See our recent blog on this topic for more details.
     Because of the ongoing interest in screening for OSA, several publications have emerged since then.  Most recently, the original group published a new analysis, attempting to improve the screen.  Fortunately, they do discuss the pre-test probability.  Unfortunately, they continue to claim that the score indicates the patients risk, which creates a paradox. In their publication, they performed polysomnography to know what portion was normal and what portions were diagnosed with mild, moderate, and severe sleep apnea.  It turns out that about one-third of the patients they tested had at least moderate sleep apnea.  By using this value as the pre-test probability, and then updating the probability after a positive STOP-BANG test, the post-test probability of sleep apnea was over 50%.  It seems reasonable to call this post-test probability a “high risk” for sleep apnea.  But it was >50% because of all three ingredients – not just the test result – but also the pre-test probability.  For those with a normal STOP-BANG score, they assign the label “low risk”.  One could argue whether the post-test probability here, of 20%, of at least moderate sleep apnea should really be considered “low”, especially since it is far higher than reported in the general adult population.  But that is not the paradox – that comes if we think about a positive result from the very same tool, but in a population where only 5% are expected to have sleep apnea.  Remember, their claim is that a positive screen result means high risk.  But in this kind of population, the positive screen result gives us a post-test probability lower than 20% - which the authors just said should be called “low risk”.  This apparent paradox is understood, and avoided, by simply remembering Bayes’ Theorem:  without knowing all three ingredients, we cannot interpret the test properly.  Put another way, the risk is not simply given by the test result – it must be interpreted in context, and that context is the pre-test probability.
 
Contributed by:  Matt Bianchi MD PhD
 

The Tale of the Magic Coin: A Bayesian Tragedy

11/9/2016

 
     I want to sell you this nickel – it has special powers to detect sleep apnea.  When you toss it in front of a patient, heads is a positive test for sleep apnea.  Since we are in the age of evidence-based medicine, I want to convince you this is the real deal, let me share the data with you.  The experiment was to flip the coin for 180 people already diagnosed with sleep apnea (the true positives), and 20 healthy adults with no sleep disorders (the true negatives).  We found that the nickel was correct 90% of the time when it came up heads!  In statistical parlance, this means the nickel test has a 90% positive predictive value – excellent!
     Let’s walk through the data calculations, in case you remain skeptical of the magic nickel.  The 2x2 box (Figure A) shows how a standard experiment is set up.  The left column contains the true cases and the right column contains the healthy controls.  These columns represent the true or gold standard status of each person.  The top row contains those who had a positive coin toss (heads), and the bottom row contains those who had a negative coin toss (tails). The sensitivity is the proportion of positive tests from the pool of known cases, and the specificity is the proportion of negative tests from the pool of healthy controls.  The positive predictive value is the proportion of true positives over all who tested positive, while the negative predictive value is the proportion of true negatives over all who tested negative.     
     My coin is actually so special that it actually performs magic even though it has an equal chance of showing heads or tails when you flip it.  Figure (B) shows that the sensitivity and specificity are each 50%, which would be expected for an ordinary coin.  The magic comes when we calculate the positive predictive value: this is defined as the true positives (upper left box) divided by all positives (sum of the upper left and the upper right boxes).  We see this is 90 divided by 90+10 (i.e., 100), which equals 90%!  The math clearly shows the coin has the magic touch!
     WAIT - Where is the trick?  How can a fair coin, randomly showing up heads or tails, correctly predict a person’s disease status?  Luckily, with a little help from Bayes’ Theorem we can see right through this smoke screen.  In this experiment, 90% of the people tested had sleep apnea (180 of the 200 total).  The coin flip was really just a random result.  It came up heads for half of the people, as we would expect for a normal coin.  If we looked at a random group from our experiment (say, half of them), we would expect 90% of them to have sleep apnea.  The coin did just that – randomly “picked” half of the people as heads – and thus it didn’t tell us anything at all beyond what we already would have guessed.  Likewise, when the coin came up tails, it was only “right” 10% of the time – because only 10% of the population did not have sleep apnea.   
     Bayes’ Theorem tells us that we need three ingredients to make sense of any test result:  sensitivity, specificity, and pre-test probability.  Tragedy awaits any who tread the fields of test interpretation without being armed by all three.  The sensitivity is the proportion of true cases correctly identified by the test as having the disease – the coin has a sensitivity of 50%.  The specificity is the proportion of healthy people the test correctly identifies as NOT having the disease – the coin has a specificity of 50%.  The pre-test probability is the portion of the population being tested who have the disease – in this experiment, it was 90%. 
     Now that is a lot of math – if only there were a quick rule of thumb so you can’t be fooled by people trying to sell magic coins… there is!  I call it the “Rule of 100”.  If any test has a sensitivity and specificity that add up to 100%, the test is performing at chance level – like a random coin.  In case you are wondering if this is something special about the 50%-50% coin example, Figure (C) shows the same calculations as (B) but now using a coin that is biased towards heads, which comes up 90% of the time.  In this case, sensitivity (90%) plus specificity (10%) adds to 100, and we see the positive predictive value is still 90%, just like when the first example magic coin was actually an ordinary fair coin.
Why would anyone make a test that was nothing more than a random number generator?  It’s the tragedy of failing to consider all three ingredients.  It happens unfortunately more than we would like.  Although there are many published examples, one recent article serves to highlight the problem: reported screening test results for sleep apnea tool that violates the Rule of 100, but was actually called “accurate” by the authors (see my blog posting on this paper).
               

Contributor:  Matt Bianchi MD PhD

A reflection on risk:  the ASV safety alert 1 year later

11/2/2016

 

     On May 13, 2015, ResMed announced a safety alert ahead of publication of a large trial of their adaptive PAP system (“ASV”) for heart failure patients with central apnea [1].  The primary endpoint that the trial was designed to answer showed no effect of ASV.  The unexpected finding came in an exploratory subset analysis (not the main goal of the trial): higher cardiovascular mortality in those using ASV (annual risk of 10% versus 7.5%).  On May 15, the American Academy of Sleep Medicine posted their initial response to this announcement.  In June 2015, the annual SLEEP conference hosted a special session in which cogent criticisms and concerns were raised in a balanced discussion about the trial, but the risk announcements had already been made public. A recent editorial detailed many points of uncertainty about the trial [2].  It remains unknown why the main outcome was negative, and why the subset analysis suggested increased cardiac risk with ASV use.
     As with the controversial SAVE trial results recently published (see blog entry of September 22, 2016), the details of a study influence how confidently one can interpret the findings.  This should come as no surprise:  extensive effort and resources go into trial design to ensure the data is the highest quality possible.  The MGH Sleep Division reviewed the ASV trial, called the “SERVE-HF” study, and admittedly our own physicians held diverse opinions about the results, and each brings their own calculus of risk tolerance.  Patients also are entitled to their own risk calculus – ideally formulated together with their treating physicians for any healthcare decision.
     The challenges of the SERVE-HF trial can be summarized in two major categories: experimental design, and therapy effectiveness.  In this trial, wearing ASV for sleep apnea, versus no treatment for sleep apnea, was randomly assigned.  The intention of randomization is to make sure that other factors that could impact the results are evenly distributed (by chance) in each group.  The problem is, it didn’t work for the SERVE-HF trial:  it turned out, by chance, that the group assigned to wear ASV had a crucial significant difference: a 42% higher rate of anti-arrhythmia medication use.  Why is this important?  Because being on such a drug was associated with a greater mortality risk for patients on all three major outcomes (combined endpoint, all-cause mortality, and cardiovascular mortality).  The trial did not report smoking status, or sleeping pill use, both of which are independently associated with mortality in other studies – might these have also been unevenly occurring (by chance) in the ASV group?  Might this have contributed to the excess cardiac mortality in the ASV group?
     Although the investigators made efforts to ensure the machines were working properly, many using ASV had significant levels of ongoing sleep apnea according to objective measures including oximetry.  A machine-reported event index of <10 per hour was taken to indicate adequate therapy – however, recent data suggests that the proprietary algorithms used are too lenient, meaning that breathing problems are often worse than the machine is reporting [3].  In addition, compliance with ASV was modest, only about 3.5hrs per night.  Other details of the study include that patients with <20% Cheyne-Stokes pattern showed benefit with ASV, and a non-significant trend toward worsening risk in those with LVEF<30%. 
     These and other factors complicate interpretation of the data.  What about the possibility that ASV is actually harmful to certain patients?  Some ideas include that high pressures and over-ventilation negatively impact cardiac function, and the patients in this trial were vulnerable due to their substantial heart failure.  Years before this trial, my colleague at BIDMC, Dr Robert Thomas, described several potential concerns about the “complexity” of complex apnea and the reliance on machine algorithms [4]. 
     As the field struggles with these concerns and awaits future data to help navigate the uncertainty, providers and patients must work together to balance risk-benefit trade-offs.  This may be easier said than done when risk tolerance may differ among regulators, physicians, and patients.  

References:
[1] Cowie et al (2015) Adaptive Servo-Ventilation for Central Sleep Apnea in Systolic Heart Failure.
N Engl J Med. 373(12):1095-105.  Full text here:  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4779593/
 
[2] Javaheri et al (2016) SERVE-HF: More Questions Than Answers.  Chest. 2016 Apr;149(4):900-4.
 
[3] Reiter et al (2016) Residual Events during Use of CPAP: Prevalence, Predictors, and Detection Accuracy.  J Clin Sleep Med. 12(8):1153-8.
 
[4] Thomas (2011) The chemoreflex and sleep-disordered breathing: man and machine vs. the beast.
Sleep Med. 12(6):533-5.

 
Contributor:  Matt Bianchi MD PhD

Screening for sleep apnea:  it is 100% certain that you need Bayes’ Theorem

10/27/2016

0 Comments

 
     Health screening is a common sense approach applied to many medical conditions.  It should be easy:  create a sensible tool and then validate it.  In the medical literature, this usually involves calculating performance measures such as the sensitivity, specificity, and predictive value of the tool.  Calculating each of these is simple -- the risk resides in the interpretation of these numbers [1]. 

     There are two rules of thumb that will protect anyone from mis-interpreting screening tool statistics.  The first, I call the “Rule of 100”:  If the sensitivity and specificity add up to 100, the tool is performing at chance level.  The higher the sum (and closer to 200 maximum), the more accurate the tool is.  If the sum adds to less than 100, then the performance is paradoxical (a positive result means the disease is less likely).  In statistics parlance, when the sum of sensitivity and specificity is 100, then the likelihood ratio =1, which means the pre-test probability equals the post-test probability, which means the test result added no information.
   
     The second rule is that you cannot interpret a screening test about disease probability without a key ingredient:  the pre-test probability.  This completes the Bayesian triad of ingredients, along with sensitivity and specificity, needed to interpret any screening test.  The information gained by the screen can be viewed as the difference between the pre-test probability and the post-test probability (also called the “predictive value”).   The predictive value can appear high because the pre-test probability was high, or because the test is very accurate.  Even a random coin toss, where heads is positive for the disease, will have a “good positive predictive value” if applied to a population with a high probability of the disease.
   
​     Unfortunately, many publications miss one or both of these points.  We could choose many examples, but today, a new publication will serve to demonstrate the problem [2].  The authors report one version of their screening tool for sleep apnea as having 24% sensitivity and 68% specificity.  This violates the “rule of 100”, and in fact shows that the test is paradoxical.  The supposedly improved versions of their screen showed sums of 100-103.  The Rule of 100 tells us that the screen tool added virtually nothing – because it is performing near chance. 

     Bayes’ Theorem is all too often missing in medical testing.  There are so many examples - one cannot publish letters to the editor for each (though I did recently for a particularly problematic example regarding the contentious topic of home sleep testing [3]).  Yet there is hope:  armed with just the Rule of 100, anyone can quickly glance at test performance and surmise the accuracy.  And for those that pass this Rule, then the second rule will guide interpretation of the predictive value. 
 
References:
[1] Bianchi MT, Alexander BM (2006) Evidence based diagnosis: does the language reflect the theory?
BMJ. 333(7565):442-5.   (full text at:  https://www.ncbi.nlm.nih.gov/pubmed/16931846)
 
[2] Laratta et al (2016) Validity of administrative data for identification of obstructive sleep apnea.
J Sleep Res. doi: 10.1111/jsr.12465.
 
[3] Bianchi (2015) Evidence that home apnea testing does not follow AASM practice guidelines--or Bayes' theorem. J Clin Sleep Med. 15;11(2):189.  (full text at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4298779/)

 
Contributor:  Matt Bianchi MD PhD
0 Comments

Comment on “CPAP for Prevention of Cardiovascular Events in Obstructive Sleep Apnea”

9/22/2016

 
  We applaud the authors of the SAVE trial for orchestrating such a large prospective trial in this important population(1).  Such high profile negative findings represent an opportunity to recognize and address key points of uncertainty in the field of sleep apnea from diagnosis to management.   Several factors likely contributed to “noise” in this trial that reduced its power to detect physiological benefits from CPAP therapy.  From a diagnosis standpoint, the device was a 2-channel “Level-4” at-home kit, using oximetry to quantify OSA and the flow channel to exclude those with >50% periodic breathing – neither of which are standard methods, even with “Level 3” kits which at least include effort belts. At-home kits do not provide information about periodic limb movements, which have been linked to vascular morbidity(2).   At-home kits are not validated to detect central apnea, for which this population may be at increased risk (and no data on opiate use, also a risk for central apnea, was provided).  Further, the use of at home kits for diagnosis was not in line with American Academy of Sleep Medicine standards, which require a pre-test probability of 80% for AHI>15(3,4), which is not met even in this high risk population.  There is no data regarding insomnia, a common comorbidity with sleep apnea, or the use of hypnotics which carry morbidity risk.  Even though these diagnostic points of uncertainty may have distributed evenly in randomization, they nevertheless contribute unaccounted-for variance that could reduce power. 

   Uncertainties in the treatment phase are likely to have played an even more important role.  Using auto-PAP to choose a fixed pressure is clinically commonplace, under the dual assumptions that a) CPAP is always effective, and b) machine algorithm pressure choice is equivalent to polysomnographic titration.  The first assumption ignores the role of interacting obstructive and chemoreflex phenotypes in sleep apnea pathogenesis.  Although the second assumption is supported in carefully selected populations, we do not have independent confirmation in this vascular population.  Like at home diagnostic kits, at-home auto-titration is assumed but not validated clinically to detect central apnea.  In fact, detecting pauses of any kind may not be as accurate as often assumed, as we recently showed manual scoring of machine waveforms revealed significantly higher indices than automated scoring(5).  Arguably the most important treatment-related finding was that CPAP compliance averaged only 3.3 hrs/night, and only ~42% averaged more than 4 hrs/night.  Was this enough to expect risk mitigation at all, not to mention the high risk differential that the trial was powered upon?  Although subjective endpoints of mood and the Epworth Scale improved, one of the most common expected physiological endpoints, blood pressure reduction, was not observed.  Performing careful subset analysis using propensity score matching for those with >4 hrs/night usage was also negative for vascular benefit, but this subset is expected to be under-powered.  Finally, although not measured in any clinical trials to date, the “apnea burden” also contributes variance, as off-PAP sleep time may contain significant disease(6,7).  When sleep apnea recurs during off-PAP sleep time, and even a 4-hour per night participant might only be treated for 50% of sleep time, this too may contribute to incomplete treatment in the CPAP group.     

​   In summary, we caution against an oversimplified conclusion of a study based on Level 4 diagnosis, absence of PSG at diagnosis or therapeutic phases, poor average compliance, and other contributors to noise in the data.   We hope that these issues and ongoing discussions can provide context to such “negative” outcomes, and mitigate the risk of de-prioritizing sleep apnea therapy clinically or even from a coverage perspective. The field faces real challenges in determining what is the most effective therapy for simple obstructive disease as well as phenotypically complex sleep-disordered breathing, what is the minimum effective threshold of PAP use, and how precisely do we need to understand therapy effectiveness to maximize power to ask the outcomes questions.  We also face a challenge of how research findings, especially when negative or unexpected, are discussed and disseminated.  Can one reasonably ask why the response to this trial was not an indictment of at-home methods for managing sleep apnea?  Of course, the trial was not designed to answer the in-lab versus at-home question, but when strong concerns exist for the questions it was designed to answer, we are beholden to think critically about how we can do better for our patients, and how we as a field choose to discuss research findings with the very audience who depend on us to curate biomedical knowledge for their benefit. 

Contributors:  Matt Bianchi MD PhD (MGH), and Robert Thomas MD (BIDMC)

References:
1.            McEvoy RD, Antic NA, Heeley E, et al. CPAP for Prevention of Cardiovascular Events in Obstructive Sleep Apnea. The New England journal of medicine. Sep 8 2016;375(10):919-931.
2.            Walters AS, Rye DB. Review of the relationship of restless legs syndrome and periodic limb movements in sleep to hypertension, heart disease, and stroke. Sleep. May 2009;32(5):589-597.
3.            Collop NA, Tracy SL, Kapur V, et al. Obstructive sleep apnea devices for out-of-center (OOC) testing: technology evaluation. Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine. Oct 15 2011;7(5):531-548.
4.            Collop NA, Anderson WM, Boehlecke B, et al. Clinical guidelines for the use of unattended portable monitors in the diagnosis of obstructive sleep apnea in adult patients. Portable Monitoring Task Force of the American Academy of Sleep Medicine. Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine. Dec 15 2007;3(7):737-747.
5.            Reiter J, Zleik B, Bazalakova M, Mehta P, Thomas RJ. Residual Events during Use of CPAP: Prevalence, Predictors, and Detection Accuracy. Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine. 2016;12(8):1153-1158.
6.            Boyd SB, Walters AS. Effectiveness of Treatment Apnea-Hypopnea Index: A Mathematical Estimate of the True Apnea-Hypopnea Index in the Home Setting. J Oral Maxillofac Surg. Jul 5 2012.
7.            Bianchi MT, Alameddine Y, Mojica J. Apnea burden: efficacy versus effectiveness in patients using positive airway pressure. Sleep medicine. Dec 2014;15(12):1579-1581.

Forward>>
    Dr. Bianchi's Blog

    Archives

    November 2017
    October 2017
    September 2017
    August 2017
    June 2017
    March 2017
    January 2017
    December 2016
    November 2016
    October 2016
    September 2016

    Guided Self Testing
    Insomnia Feedback Pilot
    The Mother of all Statistical Tests
    Dress to Impress, I Guess
    ACGME and residency work hours
    REM sleep and dementia
    Sleep Apnea and Sleep Architecture
    Screening for OSA: automated algorithm
    Introducing the Bayes Statistics Avenger!
    The missing link in jet-lag planning
    Reflections on drug therapy for insomnia
    Dr Bianchi tests 5 sleep trackers
    Paradoxes- Bayes to the rescue!
    The Tale of the Magic Coin: A Bayesian Tragedy
    A reflection on risk: the ASV safety alert 1 year later
    OSA screening paradox
    SAVE Trial
Powered by Create your own unique website with customizable templates.