Page 61 - JSOM Spring 2026
P. 61
Methods AI Model Development and Validation
ATLAS is a vendor-agnostic lung sliding classification algorithm
Study Design developed by Deep Breathe Inc. It was trained on a diverse data-
A pilot-prospective, fully crossed MRMC diagnostic accuracy set comprising 6,266 LUS clips collected from multiple clinical
study was conducted to quantify the effect of AI assistance centers across Canada, the United States, and Chile. The dataset
on LUS interpretation. Five active-duty United States Marine reflects a broad range of scanning conditions and image charac-
Corps hospital Corpsmen served as readers. None of the teristics, including clips acquired using ultrasound systems from
readers had formal LUS training, but all received a focused several vendors including Sonosite, Mindray, Philips, Butterfly,
5-minute teaching session to learn about LUS and lung sliding, and GE. All training annotations were assigned by a clinical re-
including practice interpretation on two clips not included in searcher with LUS training (truther). To promote consistency
the study set. The definitions provided during the training are and ensure that challenging cases informed model development,
detailed in Appendix 1.
the truther flagged challenging clips for secondary review by a
Canadian board- certified critical care and emergency physician
Each reader then interpreted the same 50 de-identified LUS with more than 20 years of LUS experience (reviewer).
clips twice in two separate sessions: once without AI assistance
(unassisted condition) and once with a binary AI prediction Prior to this pilot study, the standalone performance of ATLAS
displayed alongside the clip (assisted condition). Condition or- was evaluated on an independent holdout set of 786 LUS clips.
der was computer-randomized in blocks, and clip order was Like the training set, this validation set featured data from a
independently randomized for each reader in every session. A range of probe vendors and diverse imaging conditions and
minimum washout period of 2 hours separated the sessions to was specifically sourced from geographically distinct sites not
reduce recall bias.
included in training. This design ensured that performance es-
timates were both unbiased and representative of real-world
Setting clinical variability. Three U.S. board-certified emergency phy-
All reading sessions were embedded within a scheduled Ma- sicians, who were not involved in the training set labeling, as-
rine Corps field exercise to reflect operational conditions. signed the validation set annotations by consensus. On this
Readers wore full tactical gear and worked under natural day- holdout set, ATLAS achieved a sensitivity of 0.94 (95% CI
light, which introduced screen glare. Ambient noise, time con- 0.90–0.97) and a specificity of 0.88 (95% CI 0.85–0.90) for
straints, and concurrent drill tasks were preserved to simulate detecting absent lung sliding.
the cognitive demands of an operational training environment.
Data Analysis
Participants To quantify diagnostic certainty for AUROC analysis, each
Participants were a convenience sample of five active-duty reader’s binary interpretation was combined with their con-
United States Marine Corps Corpsmen. None of the partici- fidence rating to generate a certainty index ranging from 0
pants had any formal training in POCUS prior to the study. to 10. Lung sliding present was defined as the negative class.
While the lack of POCUS experience was not a deliberate se- Scores at the lower end of the scale reflected increasing con-
lection criterion, it reflects the reality that many Corpsmen in fidence that lung sliding was present, with 0 representing a
operational environments have little or no prior exposure to very confident assessment that sliding was present and 10 rep-
ultrasound.
resenting high confidence that lung sliding was absent. This
approach allowed the binary decision and associated confi-
Study Procedures and Clip Interpretation dence to be mapped onto a continuous spectrum for receiver
Fifty B-mode LUS clips, each lasting three to seven seconds, operating characteristic (ROC) analysis.
were randomly sampled from a research repository main-
tained by Deep Breathe Inc. Ground truth labels indicating The study employed a fully crossed, paired design in which
whether lung sliding was present or absent were established each reader evaluated every case under both assisted and un-
through consensus review by three board-certified intensivists. assisted conditions. The primary endpoint was the difference
An equal number of clips demonstrating lung sliding present in AUROC between conditions using the random-reader, ran-
and absent were labelled, resulting in a balanced dataset of 25 dom-case (RRRC) model. Secondary outcomes included dif-
clips in each category.
ferences in sensitivity, specificity, and accuracy, which were
evaluated for statistical significance using McNemar’s test.
In the unassisted session, readers viewed the clips in a ran- Confidence score distributions across the certainty index were
domized order and recorded: 1) a binary interpretation of lung compared using the paired Stuart-Maxwell marginal homoge-
sliding presence or absence and 2) a confidence rating using a neity χ² test. Two-sided P values <.05 were considered statisti-
five-point scale (Appendix 2).
cally significant. All statistical analyses were performed using
Python, with the exception of the RRRC analysis which was
In the assisted session, readers completed the same task with performed using R (version 4.4.0).
the same set of clips in re-randomized order with the addition
of the AI tool’s binary output shown above each video. The AI
model, called ATLAS (developed by Deep Breathe Inc., Lon- Results
don, Ontario), is a binary classifier trained to identify the pres- Participants demonstrated significantly improved diagnostic
ence or absence of lung sliding. Readers were informed that performance when supported by AI compared to their un-
the tool was an automated aid, but they received no additional assisted interpretations. The mean AUROC increased from
instruction or training. Participants were not allowed to revisit 0.715 (95% CI 0.519–0.910) to 0.925 (95% CI 0.864–0.986),
previous responses, and no feedback was provided between or which was statistically significant under the RRRC framework
during the sessions.
(P=.032) (Figure 1). Other diagnostic metrics showed similar
AI-Assisted Lung Sliding Detection | 59

