Page 61 - JSOM Spring 2026
P. 61

Methods                                            AI Model Development and Validation
                                                                 ATLAS is a vendor-agnostic lung sliding classification algorithm
              Study Design                                       developed by Deep Breathe Inc. It was trained on a diverse data-
              A pilot-prospective, fully crossed MRMC diagnostic accuracy   set comprising 6,266 LUS clips collected from multiple clinical
              study  was conducted  to  quantify the  effect  of AI assistance   centers across Canada, the United States, and Chile. The dataset
              on LUS interpretation. Five active-duty United States Marine    reflects a broad range of scanning conditions and image charac-
              Corps hospital Corpsmen served as readers. None of the   teristics, including clips acquired using ultrasound systems from
              readers had formal LUS training, but all received a focused   several vendors including Sonosite, Mindray, Philips, Butterfly,
              5-minute teaching session to learn about LUS and lung sliding,   and GE. All training annotations were assigned by a clinical re-
              including practice interpretation on two clips not included in   searcher with LUS training (truther). To promote consistency
              the study set. The definitions provided during the training are   and ensure that challenging cases informed model development,
              detailed in Appendix 1.
                                                                 the truther flagged challenging clips for secondary review by a
                                                                 Canadian board- certified critical care and emergency physician
              Each reader then interpreted the same 50 de-identified LUS   with more than 20 years of LUS experience (reviewer).
              clips twice in two separate sessions: once without AI assistance
              (unassisted condition) and once with a binary AI prediction   Prior to this pilot study, the standalone performance of ATLAS
              displayed alongside the clip (assisted condition). Condition or-  was evaluated on an independent holdout set of 786 LUS clips.
              der was computer-randomized in blocks, and clip order was   Like the training set, this validation set featured data from a
              independently randomized for each reader in every session. A   range of probe vendors and diverse imaging conditions and
              minimum washout period of 2 hours separated the sessions to   was specifically sourced from geographically distinct sites not
              reduce recall bias.
                                                                 included in training. This design ensured that performance es-
                                                                 timates were both unbiased and representative of real-world
              Setting                                            clinical variability. Three U.S. board-certified emergency phy-
              All reading sessions were embedded within a scheduled Ma-  sicians, who were not involved in the training set labeling, as-
              rine  Corps field  exercise  to reflect  operational  conditions.   signed  the  validation  set  annotations  by  consensus.  On  this
              Readers wore full tactical gear and worked under natural day-  holdout set, ATLAS achieved a sensitivity of 0.94 (95% CI
              light, which introduced screen glare. Ambient noise, time con-  0.90–0.97) and a specificity of 0.88 (95% CI 0.85–0.90) for
              straints, and concurrent drill tasks were preserved to simulate   detecting absent lung sliding.
              the cognitive demands of an operational training environment.
                                                                 Data Analysis
              Participants                                       To quantify diagnostic certainty for  AUROC analysis, each
              Participants were a convenience sample of five active-duty   reader’s binary interpretation was combined with their con-
              United States Marine Corps Corpsmen. None of the partici-  fidence rating to generate a certainty index ranging from 0
              pants had any formal training in POCUS prior to the study.   to 10. Lung sliding present was defined as the negative class.
              While the lack of POCUS experience was not a deliberate se-  Scores at the lower end of the scale reflected increasing con-
              lection criterion, it reflects the reality that many Corpsmen in   fidence that lung sliding was present, with 0 representing a
              operational environments have little or no prior exposure to   very confident assessment that sliding was present and 10 rep-
              ultrasound.
                                                                 resenting high confidence that lung sliding was absent. This
                                                                 approach allowed the binary decision and associated confi-
              Study Procedures and Clip Interpretation           dence to be mapped onto a continuous spectrum for receiver
              Fifty B-mode LUS clips, each lasting three to seven seconds,   operating characteristic (ROC) analysis.
              were randomly sampled from a research repository main-
              tained by Deep Breathe Inc. Ground truth labels indicating   The study employed a fully crossed, paired design in which
              whether lung sliding was present or absent were established   each reader evaluated every case under both assisted and un-
              through consensus review by three board-certified intensivists.   assisted conditions. The primary endpoint was the difference
              An equal number of clips demonstrating lung sliding present   in AUROC between conditions using the random-reader, ran-
              and absent were labelled, resulting in a balanced dataset of 25   dom-case (RRRC) model. Secondary outcomes included dif-
              clips in each category.
                                                                 ferences in sensitivity, specificity, and accuracy, which were
                                                                 evaluated for statistical significance using McNemar’s test.
              In the unassisted session, readers viewed the clips in a ran-  Confidence score distributions across the certainty index were
              domized order and recorded: 1) a binary interpretation of lung   compared using the paired Stuart-Maxwell marginal homoge-
              sliding presence or absence and 2) a confidence rating using a   neity χ² test. Two-sided P values <.05 were considered statisti-
              five-point scale (Appendix 2).
                                                                 cally significant. All statistical analyses were performed using
                                                                 Python, with the exception of the RRRC analysis which was
              In the assisted session, readers completed the same task with   performed using R (version 4.4.0).
              the same set of clips in re-randomized order with the addition
              of the AI tool’s binary output shown above each video. The AI
              model, called ATLAS (developed by Deep Breathe Inc., Lon-  Results
              don, Ontario), is a binary classifier trained to identify the pres-  Participants demonstrated significantly improved diagnostic
              ence or absence of lung sliding. Readers were informed that   performance when supported by  AI compared to their un-
              the tool was an automated aid, but they received no additional   assisted interpretations.  The mean  AUROC increased from
              instruction or training. Participants were not allowed to revisit   0.715 (95% CI 0.519–0.910) to 0.925 (95% CI 0.864–0.986),
              previous responses, and no feedback was provided between or   which was statistically significant under the RRRC framework
              during the sessions.
                                                                 (P=.032) (Figure 1). Other diagnostic metrics showed similar

                                                                                    AI-Assisted Lung Sliding Detection  |  59
   56   57   58   59   60   61   62   63   64   65   66