Caitlin Reiter, Grand Canyon University
Abstract
New Tool for Identifying Data Artifacts in Serially Collected Crystallographic Data
Caitlin M. Reiter, 3 Gihan Ketawala, 1,2 and Sabine Botha 1,3*
1 BIODESIGN CENTER FOR APPLIED STRUCTURAL DISCOVERY, ARIZONA STATE UNIVERSITY, TEMPE, AZ 85287-5001 USA, 2 SCHOOL OF MOLECULAR SCIENCES,
ARIZONA STATE UNIVERSITY, TEMPE, AZ 85287-1604 USA, AND C DEPARTMENT OF PHYSICS, ARIZONA STATE UNIVERSITY, TEMPE, AZ 85287-1504, USA.
E-MAIL: SBOTHA@ASU.EDU
Serial femtosecond crystallography (SFX) has revolutionized the capabilities of structure resolution under ambient conditions. Employing the concept of "diffraction before destruction", micrometer sized crystals can be probed at near physiological conditions before the onset of structure altering radiation damage. This damage is the result of singly exposing thousands of crystals in a serial data collection approach to ultra-bright, ultra-short X-ray pulses. However, the data collected using this approach is often plagued by stochastic inconsistencies in the data, due to the crystal delivery mechanisms, the detectors, or fluctuations of the X-ray source/experimental end station. Here, we report on a new data sorting tool that offers a variety of Machine Learning algorithms to sort data trained by either manual data sorting by the user, or by profile fittng of the expected intensity distribution of the detector. This is integrated into an easy-to-use graphical user interface (GUI), specifically designed to support the detectors, file formats, and software available at popular XFELs. Results from four different machine learning (ML) algorithms are implemented into the new data sorting interface to highlight improved statistics with lower data redundancy. The data tested is from SARS-CoV-2 NendoU protein collected on the EPIX10k-2M detector1.
References
1. Jernigan RJ et. al.. Structure. 2023 Feb 2;31(2):138-151.e5. doi: 10.1016/j.str.2022.12.009.