Back

Impact of Regularization Methods and Outlier Removal on Unsupervised Sample Classification

Heckman, C. A.

2026-04-10 bioinformatics
10.64898/2026.04.07.716815 bioRxiv
Show abstract

BackgroundHigh-content assays (HCAs) have problems distinguishing biologically significant effects from the incidental effects of non-repeatable technical factors. Non-repeatable results are attributed to variations in the cell culture environment and the numerous, heterogeneous descriptors evaluated. The aim here was to determine whether preprocessing operations impacted the reproducibility of class assignments of experimental data. MethodsBatch effects that could affect reproducibility, i.e., signal/noise ratio, instrumental conditions, and segmentation, were controlled variables. The remaining batch effects, variations in materials, personnel, and culture environment could not be controlled. Descriptors values were measured directly from images. Exploratory factor analysis was used to solve the identifiable and interpretable feature, factor 4. In each of five trials, one sample was treated with the same chemical mixture (EXP) and another with the solvent vehicle alone (CON). ResultsRepeated CON and EXP samples showed significant differences among factor 4 means in data regularized within each trial. The mean of Trial 3 CON differed significantly from all other CON samples. These differences disappeared upon regularization to comprehensive databases. Among repeated EXPs, the Trial 2 mean differed from three other EXPs, but regularization to comprehensive databases had little effect. However, classification patterns were unchanged after regularization to any comprehensive database derived by the same protocol. After regularization to datasets derived by two different protocols, the classification pattern differed but only reflected elevation of differences that had been marginal to statistical significance. Outlier removal was deleterious. Even with the most sparing definition of outliers, over 3% of a single samples contents were removed from most trials. Elimination based on the overall within-trial distributions caused type I and type II errors. ConclusionsNon-repeatable factor 4 means in repeated trials had negligible influence on classification outcomes, so repeatability may not be a good indicator of assay quality. Irreducible batch effects, combined with small sample sizes and skewed distributions of descriptors values, may account for non-repeatability. As the current results are based on real-world data, they suggest that non-repeatability is an uncorrectable feature of these assays. Classification patterns are not affected by several irreducible technical factors, namely materials, personnel, and non-repeatable environmental variables.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 10%
18.5%
2
Analytical Chemistry
205 papers in training set
Top 0.3%
9.3%
3
Scientific Reports
3102 papers in training set
Top 17%
6.5%
4
BMC Bioinformatics
383 papers in training set
Top 2%
6.5%
5
SLAS Technology
11 papers in training set
Top 0.1%
6.4%
6
Microbiology Spectrum
435 papers in training set
Top 1%
2.6%
7
SLAS Discovery
25 papers in training set
Top 0.1%
2.5%
50% of probability mass above
8
Journal of Immunological Methods
24 papers in training set
Top 0.1%
2.1%
9
PeerJ
261 papers in training set
Top 5%
1.9%
10
Analytical and Bioanalytical Chemistry
17 papers in training set
Top 0.1%
1.9%
11
Analytica Chimica Acta
17 papers in training set
Top 0.3%
1.8%
12
Science of The Total Environment
179 papers in training set
Top 3%
1.8%
13
The Analyst
15 papers in training set
Top 0.2%
1.7%
14
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
15
Frontiers in Plant Science
240 papers in training set
Top 4%
1.3%
16
Limnology and Oceanography: Methods
11 papers in training set
Top 0.3%
1.0%
17
Nature Communications
4913 papers in training set
Top 59%
0.9%
18
BMC Methods
11 papers in training set
Top 0.1%
0.9%
19
Food Chemistry
12 papers in training set
Top 0.5%
0.8%
20
ACS Omega
90 papers in training set
Top 4%
0.8%
21
Peer Community Journal
254 papers in training set
Top 4%
0.8%
22
Frontiers in Medicine
113 papers in training set
Top 7%
0.7%
23
Biology
43 papers in training set
Top 3%
0.7%
24
Frontiers in Microbiology
375 papers in training set
Top 11%
0.5%
25
PLOS Computational Biology
1633 papers in training set
Top 29%
0.5%
26
Journal of Medical Microbiology
20 papers in training set
Top 0.9%
0.5%
27
Forensic Science International: Genetics
24 papers in training set
Top 0.2%
0.5%
28
The Journal of Molecular Diagnostics
36 papers in training set
Top 0.6%
0.5%
29
Journal of Clinical Virology
62 papers in training set
Top 1%
0.5%
30
Environmental Science: Water Research & Technology
13 papers in training set
Top 0.4%
0.5%