Poster Session 2
Aris T. Papageorghiou, FRCOG, MBBCH, MD
Professor of Fetal Medicine
University of Oxford
Oxford, England, United Kingdom
J Alison Noble, PhD
Professor of Biomedical Engineering
University of Oxford
Oxford, England, United Kingdom
Olga Patey, MD
University of Oxford
Oxford, England, United Kingdom
Mostafa Sarker
University of Oxford
Oxford, England, United Kingdom
Netzahualcoyotl Hernandez-Cruz, N/A
University of Oxford
Oxford, England, United Kingdom
Divyanshu Mishra
University of Oxford
Oxford, England, United Kingdom
Beverly Tsai-Goodman
Royal Brompton Hospital
London, England, United Kingdom
When building machine learning algorithms in a collaborative manner, annotating ultrasound videos of fetal heart "sweeps" is important prerequisite. These annotations are undertaken by experts following standard guidelines and criteria. To ensure these data for training machine learning algorithms are reliable, we aimed to create a system to quantify inter and intra-annotation variation. The ultimate aim is to create quality assurance systems of video annotations to reduce annotator bias.
Study Design:
Two experienced cardiologists (A1 and A2) each annotated 4,539 individual ultrasound frames from ten fetal heart ultrasound videos (transverse sweep). Individual frames received one of five fetal cardiac labels through manual annotation. Pairwise comparisons were made between annotations conducted two weeks apart by A1 (intra-annotator agreement) and between cardiologists A1 and A2 (inter-annotator agreement). Inter- and intra-annotator agreements were quantified on a frame-by-frame basis using Intra-Class Correlation Coefficient [ICC] and Kappa score values.
Results:
Successful labelling of all frames were carried out. Intra-annotator reproducibility showed correlation strong agreement (ICC = 74.9 %, Kappa score of 81%). Inter-annotator agreement between A1 and A2 was moderate (ICC = 66.6, Kappa score of 60%).
Conclusion:
The results show high reproducibility of annotations of frames of the fetal heart from ultrasound sweeps for a single annotator. However, when comparing annotations between different annotators, the results are less consistent, indicating only moderate agreement. These variations have implications for quality assurance and what constitutes "ground truth" when using fetal heart annotations in ultrasound videos for training machine learning algorithms.