Back

Informative Missingness in Nominal Data: A Graph-Theoretic Approach to Revealing Hidden Structure

Zangene, E.; Schwammle, V.; JAFARI, M.

2026-02-03 bioinformatics
10.1101/2025.08.22.670516 bioRxiv
Show abstract

Missing data is often treated as a nuisance, routinely imputed or excluded from statistical analyses, especially in nominal datasets where its structure cannot be easily modeled. However, the form of missingness itself can reveal hidden relationships, substructures, and biological or operational constraints within a dataset. In this study, we present a graph-theoretic approach that reinterprets missing values not as gaps to be filled, but as informative signals. By representing nominal variables as nodes and encoding observed or missing associations as edges, we construct both weighted and unweighted bipartite graphs to analyze modularity, nestedness, and projection-based similarities. This framework enables downstream clustering and structural characterization of nominal data based on the topology of observed and missing associations; edge prediction via multiple imputation strategies is included as an optional downstream analysis to evaluate how well inferred values preserve the structure identified in the non-missing data. Across a series of biological, ecological, and social case studies, including proteomics data, the BeatAML drug screening dataset, ecological pollination networks, and HR analytics, we demonstrate that the structure of missing values can be highly informative. These configurations often reflect meaningful constraints and latent substructures, providing signals that help distinguish between data missing at random and not at random. When analyzed with appropriate graph-based tools, these patterns can be leveraged to improve the structural understanding of data and provide complementary signals for downstream tasks such as clustering and similarity analysis. Our findings support a conceptual shift: missing values are not merely analytical obstacles but valuable sources of insight that, when properly modeled, can enrich our understanding of complex nominal systems across domains. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=107 SRC="FIGDIR/small/670516v2_ufig1.gif" ALT="Figure 1"> View larger version (24K): org.highwire.dtl.DTLVardef@99c5eaorg.highwire.dtl.DTLVardef@1909d8corg.highwire.dtl.DTLVardef@1578c93org.highwire.dtl.DTLVardef@ce2e90_HPS_FORMAT_FIGEXP M_FIG C_FIG Shiny app address https://ehsan-zangene.shinyapps.io/nimaa_app/

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
Patterns
70 papers in training set
Top 0.1%
10.4%
2
Cell Systems
167 papers in training set
Top 1%
10.1%
3
Nature Methods
336 papers in training set
Top 1%
9.1%
4
Bioinformatics
1061 papers in training set
Top 3%
8.4%
5
Nature Communications
4913 papers in training set
Top 26%
6.8%
6
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 11%
6.4%
50% of probability mass above
7
PLOS Computational Biology
1633 papers in training set
Top 6%
6.4%
8
Nature Biotechnology
147 papers in training set
Top 3%
2.4%
9
Bioinformatics Advances
184 papers in training set
Top 2%
2.4%
10
Journal of Proteome Research
215 papers in training set
Top 1%
2.1%
11
Genome Biology
555 papers in training set
Top 4%
1.9%
12
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.9%
13
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
14
Molecular & Cellular Proteomics
158 papers in training set
Top 1%
1.7%
15
Molecular Systems Biology
142 papers in training set
Top 0.8%
1.5%
16
Cell
370 papers in training set
Top 13%
1.3%
17
eLife
5422 papers in training set
Top 47%
1.3%
18
Briefings in Bioinformatics
326 papers in training set
Top 5%
1.3%
19
Advanced Science
249 papers in training set
Top 14%
1.2%
20
PLOS ONE
4510 papers in training set
Top 60%
1.2%
21
Journal of Molecular Biology
217 papers in training set
Top 2%
1.2%
22
Nucleic Acids Research
1128 papers in training set
Top 14%
1.1%
23
Cell Reports Methods
141 papers in training set
Top 4%
0.9%
24
Genome Research
409 papers in training set
Top 4%
0.8%
25
iScience
1063 papers in training set
Top 29%
0.8%
26
Nature Machine Intelligence
61 papers in training set
Top 3%
0.8%
27
Protein Science
221 papers in training set
Top 2%
0.7%
28
Scientific Reports
3102 papers in training set
Top 75%
0.7%
29
Genomics, Proteomics & Bioinformatics
171 papers in training set
Top 7%
0.7%