eISSN: Applied editor@oxfordianfoundation.com

American Journal of Data Science and Machine Learning

Open Access Peer Review International
Open Access

Architecting FAIR Digital Objects and Computational Workflows: Interoperable Metadata, Persistent Identification, and Reproducible Research Infrastructures

Department of Information Studies University of Copenhagen Denmark

Abstract

The exponential growth of data intensive research across life sciences, Earth sciences, and computational domains has exposed profound challenges in interoperability, reproducibility, and long term stewardship of digital research assets. The FAIR principles have provided a normative framework for making digital objects Findable, Accessible, Interoperable, and Reusable. However, the operationalization of FAIR at scale requires coordinated infrastructures that integrate persistent identifiers, semantic web standards, research object packaging, workflow provenance capture, and policy aligned governance mechanisms. This article presents a comprehensive theoretical and architectural synthesis of interoperable FAIR digital objects and computational workflows grounded in contemporary specifications and community driven implementations. Drawing on standards such as the Digital Object Interface Protocol specification, PROV O, RDF 1.1, Schema.org, OCFL, BagIt, IEEE 2791 2020, and RO Crate, as well as community platforms including myExperiment, Whole Tale, OpenAIRE, the NCI Genomic Data Commons, and EOSC interoperability frameworks, this study constructs a layered conceptual model for digital object management. The analysis examines metadata modeling through Bioschemas and Science on Schema.org, ontology reuse, machine actionable data management plans, and persistent identifier design patterns. Particular emphasis is placed on computational workflow reproducibility using engines such as Snakemake and Galaxy, software distribution ecosystems such as Bioconda, cross platform packaging in Debian, and provenance frameworks including CWLProv and Pegasus. The article critically interrogates socio technical tensions between researcher usability and stewardship compliance, drawing on debates about data management fatigue and lifestyle oriented FAIR practice. It advances a detailed interoperability architecture integrating digital object identifiers, research object crates, linked data graphs, and repository storage layouts compliant with OCFL and BagIt. Through extensive theoretical elaboration, the paper articulates governance principles for data commons, cross domain metadata harmonization, and standards based international genomic data sharing. The findings demonstrate that sustainable FAIR infrastructures emerge not from isolated tools but from coordinated ecosystems combining persistent identity, semantic richness, workflow transparency, and institutional stewardship cultures. The study concludes by outlining policy, technical, and cultural pathways toward machine actionable, reproducible, and globally interoperable research environments.

Keywords

References

πŸ“„ 1. D Foundation. Digital Object Interface Protocol Specification, version 2.0, Technical Report, 2018. https://www.dona.net/sites/default/files/2018-11/DOIPv2Spec_1.pdf.
πŸ“„ 2. Garcia Silva A., Gomez Perez J.M., Palma R., Krystek M., Mantovani S., Foglini F., Grande V., De Leo F., Salvi S., Trasatti E., Romaniello V., Albani M., Silvagni C., Leone R., Marelli F., Albani S., Lazzarini M., Napier H.J., Glaves H.M., Aldridge T., Meertens C., Boler F., Loescher H.W., Laney C., Genazzio M.A., Crawl D., Altintas I. Enabling FAIR research in Earth science through research objects. Future Generation Computer Systems 98 (2019), 550 to 564.
πŸ“„ 3. GitHub UTS eResearch ro crate js. Research Object Crate utilities. https://github.com/UTS-eResearch/ro-crate-js.
πŸ“„ 4. GitHub workflowhub eu galaxy2cwl. Standalone version tool to get cwl descriptions of galaxy workflows. https://github.com/workflowhub-eu/galaxy2cwl.
πŸ“„ 5. GitHub CoEDL modpdsc. https://github.com/CoEDL/modpdsc.
πŸ“„ 6. GitHub CoEDL ocfl tools. Tools to process and manipulate an OCFL tree. https://github.com/CoEDL/ocfl-tools.
πŸ“„ 7. Giving software its due. Nature Methods 16(3) (2019), 207 to 207.
πŸ“„ 8. Goble C. What Is Reproducibility? The R Brouhaha, Hannover, Germany, 2016.
πŸ“„ 9. Goble C., Cohen Boulakia S., Soiland Reyes S., Garijo D., Gil Y., Crusoe M.R., Peters K., Schober D. FAIR Computational Workflows. Data Intelligence 2(1 to 2) (2019), 108 to 121.
πŸ“„ 10. Goble C., Soiland Reyes S., Bacall F., Owen S., Williams A., Eguinoa I., Droesbeke B., Leo S., Pireddu L., Rodriguez Navas L., Fernandez J.M., Capella Gutierrez S., Menager H., Gruning B., Serrano Solano B., Ewels P., Coppens F. Implementing FAIR digital objects in the EOSC life workflow collaboratory. Zenodo (2021).
πŸ“„ 11. Goble C.A., Bhagat J., Aleksejevs S., Cruickshank D., Michaelides D., Newman D., Borkum M., Bechhofer S., Roos M., Li P., De Roure D. myExperiment: A repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Research 38(Web Server issue) (2010), W677 to W682.
πŸ“„ 12. Gray A., Goble C., Jimenez R. Bioschemas: From Potato Salad to Protein Annotation. Vienna, Austria, 2017.
πŸ“„ 13. Grossman R.L., Heath A., Murphy M., Patterson M., Wells W. A case for data commons: Toward data science as a service. Computing in Science and Engineering 18(5) (2016), 10 to 20.
πŸ“„ 14. Gruning B., Chilton J., Koster J., Dale R., Soranzo N., van den Beek M., Goecks J., Backofen R., Nekrutenko A., Taylor J. Practical computational reproducibility in the life sciences. Cell Systems 6(6) (2018), 631 to 635.
πŸ“„ 15. Gruning B., Dale R., Sjodin A., Chapman B.A., Rowe J., Tomkins Tinch C.H., Valieris R., Koster J., Bioconda Team. Bioconda: Sustainable and comprehensive software distribution for the life sciences. Nature Methods 15(7) (2018), 475 to 476.
πŸ“„ 16. Guha R.V., Brickley D., Macbeth S. Schema.org: Evolution of Structured Data on the Web. Queue 13(9) (2015), 10 to 37.
πŸ“„ 17. Heath T., Bizer C. Linked Data: Evolving the Web into a Global Data Space. 2011.
πŸ“„ 18. IEEE Standard 2791 2020. IEEE Standard for Bioinformatics Analyses Generated by High Throughput Sequencing to Facilitate Communication. 2020.
πŸ“„ 19. Jensen M.A., Ferretti V., Grossman R.L., Staudt L.M. The NCI Genomic Data Commons as an engine for precision medicine. Blood 130(4) (2017), 453 to 459.
πŸ“„ 20. Jones M.B., Richard S., Vieglais D., Shepherd A., Duerr R., Fils D., McGibbney L. Science on Schema.org v1.2.0. 2021.
πŸ“„ 21. Katsumi M., Gruninger M. What is ontology reuse? Formal Ontology in Information Systems. 2016.
πŸ“„ 22. Khan F.Z., Soiland Reyes S., Sinnott R.O., Lonie A., Goble C., Crusoe M.R. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. GigaScience 8(11) (2019).
πŸ“„ 23. Kim J., Deelman E., Gil Y., Mehta G., Ratnakar V. Provenance trails in the Wings Pegasus system. Concurrency and Computation: Practice and Experience 20(5) (2008), 587 to 597.
πŸ“„ 24. Kluyver T., Ragan Kelley B., Perez F., Granger B., Bussonnier M., Frederic J., Kelley K., Hamrick J., Grout J., Corlay S., Ivanov P., Avila D., Abdalla S., Willing C., Jupyter Development Team. Jupyter Notebooks a publishing format for reproducible computational workflows. 2016.
πŸ“„ 25. Kunze J., Littman J., Madden E., Scancella J., Adams C. The BagIt File Packaging Format. RFC 8493, 2018.
πŸ“„ 26. Kurowski K., Corcho O., Choirat C., Eriksson M., Coppens F., van de Sanden M., Ojstersek M. EOSC Interoperability Framework. Technical Report, 2021.
πŸ“„ 27. Lebo T., Sahoo S., McGuinness D., Belhajjame K., Cheney J., Corsar D., Garijo D., Soiland Reyes S., Zednik S., Zhao J. PROV O: The PROV Ontology. W3C Recommendation, 2013.
πŸ“„ 28. Leipzig J., Nust D., Hoyt C.T., Ram K., Greenberg J. The role of metadata in reproducible computational research. Patterns 2(9) (2021), 100322.
πŸ“„ 29. Miksa T., Jaoua M., Arfaoui G. Research object crates and machine actionable data management plans. 2020.
πŸ“„ 30. Miksa T., Simms S., Mietchen D., Jones S. Ten principles for machine actionable data management plans. PLOS Computational Biology 15(3) (2019), e1006750.
πŸ“„ 31. Mons B. Data Stewardship for Open Science. 2018.
πŸ“„ 32. Moller S., Krabbenhoft H.N., Tille A., Paleino D., Williams A., Wolstencroft K., Goble C., Holland R., Belhachemi D., Plessy C. Community driven computational biology with Debian Linux. BMC Bioinformatics 11(Suppl 12) (2010), S5.
πŸ“„ 33. Moller S., Prescott S.W., Wirzenius L., Reinholdtsen P., Chapman B., Prins P., Soiland Reyes S., Klotzl F., Bagnacani A., Kalas M., Tille A., Crusoe M.R. Robust cross platform workflows. Data Science and Engineering 2(3) (2017), 232 to 244.
πŸ“„ 34. Neylon C. As a researcher I am a bit fed up with Data Management. 2017.
πŸ“„ 35. OCFL. Oxford Common File Layout Specification. Recommendation, 2020. https://ocfl.io/1.0/spec/.
πŸ“„ 36. RDF Working Group. RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation, 2014.
πŸ“„ 37. Rehm H.L., Page A.J.H., Smith L., Adams J.B., Alterovitz G., Babb L.J., Barkley M.P., Baudis M., Beauvais M.J.S., Beck T., Beckmann J.S., Beltran S., Bernick D., Bernier A., Bonfield J.K., Boughtwood T.F., Bourque G., Bowers S.R., Brookes A.J., Brudno M., Brush M.H., Bujold D., Burdett T., Buske O.J., Cabili M.N., Cameron D.L., Carroll R.J., Casas Silva E., Chakravarty D., Chaudhari B.P., Chen S.H., Cherry J.M., Chung J., Cline M., Clissold H.L., Cook Deegan R.M., Courtot M., Cunningham F., Cupak M., Davies R.M., Denisko D., Doerr M.J., Dolman L.I., Dove E.S., Dursi L.J., Dyke S.O.M., Eddy J.A., Eilbeck K., Ellrott K.P., Fairley S., Fakhro K.A., Firth H.V., Fitzsimons M.S., Fiume M., Flicek P., Fore I.M., Freeberg M.A., Freimuth R.R., Fromont L.A., Fuerth J., Gaff C.L., Gan W., Ghanaim E.M., Glazer D., Green R.C., Griffith M., Griffith O.L., Grossman R.L., Groza T., Guidry Auvil J.M., Guigo R., Gupta D., Haendel M.A., Hamosh A., Hansen D.P., Hart R.K., Hartley D.M., Haussler D., Hendricks Sturrup R.M., Ho C.W.L., Hobb A.E., Hoffman M.M., Hofmann O.M., Holub P., Hsu J.S., Hubaux J.P., Hunt S.E., Husami A., Jacobsen J.O., Jamuar S.S., Janes E.L., Jeanson F., Jene A., Johns A.L., Joly Y., Jones S.J.M., Kanitz A., Kato K., Keane T.M., Kekesi Lafrance K., Kelleher J., Kerry G., Khor S.S., Knoppers B.M., Konopko M.A., Kosaki K., Kuba M., Lawson J., Leinonen R., Li S., Lin M.F., Linden M., Liu X., Liyanage I.U., Lopez J., Lucassen A.M., Lukowski M., Mann A.L., Marshall J., Mattioni M., Metke Jimenez A., Middleton A., Milne R.J., Molnar Gabor F., Mulder N., Munoz Torres M.C., Nag R., Nakagawa H., Nasir J., Navarro A., Nelson T.H., Niewielska A., Nisselle A., Niu J., Nyronen T.H., O Connor B.D., Oesterle S., Ogishima S., Ota Wang V., Paglione L.A.D., Palumbo E., Parkinson H.E., Philippakis A.A., Pizarro A.D., Prlic A., Rambla J., Rendon A., Rider R.A., Robinson P.N., Rodarmer K.W., Rodriguez L.L., Rubin A.F., Rueda M., Rushton G.A., Ryan R.S., Saunders G.I., Schuilenburg H., Schwede T., Scollen S., Senf A., Sheffield N.C., Skantharajah N., Smith A.V., Sofia H.J., Spalding D., Spurdle A.B., Stark Z., Stein L.D., Suematsu M., Tan P., Tedds J.A., Thomson A.A., Thorogood A., Tickle T.L., Tokunaga K., Tornroos J., Torrents D., Upchurch S., Valencia A., Varma S., Vears D.F., Viner C., Voisin C., Wagner A.H., Wallace S.E., Walsh B.P., Williams M.S., Winkler E.C., Wold B.J., Wood G.M., Woolley J.P., Yamasaki C., Yates A.D., Yung C.K., Zass L.J., Zaytseva K., Zhang J., Goodhand P., North K., Birney E. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genomics 1(2) (2021), 100029.
πŸ“„ 38. Rettberg N., Schmidt B. OpenAIRE. College and Research Libraries News 76(6) (2015), 306 to 310.
πŸ“„ 39. Sandve G.K., Nekrutenko A., Taylor J., Hovig E. Ten simple rules for reproducible computational research. PLOS Computational Biology 9(10) (2013), e1003285.
πŸ“„ 40. Schriml L.M., Chuvochina M., Davies N., Eloe Fadrosh E.A., Finn R.D., Hugenholtz P., Hunter C.I., Hurwitz B.L., Kyrpides N.C., Meyer F., Mizrachi I.K., Sansone S.A., Sutton G., Tighe S., Walls R. COVID 19 pandemic reveals the peril of ignoring metadata standards. Scientific Data 7(1) (2020), 188.
πŸ“„ 41. Sefton P., Devine G., Evenhuis C., Lynch M., Wise S., Lake M., Loxton D. DataCrate: a method of packaging, distributing, displaying and archiving Research Objects. 2018.
πŸ“„ 42. Sefton P. FAIR Data Management Its a lifestyle not a lifecycle. 2021.
πŸ“„ 43. Sefton P., O Carragain E., Soiland Reyes S., Corcho O., Garijo D., Palma R., Coppens F., Goble C., Fernandez J.M., Chard K., Gomez Perez J.M., Crusoe M.R., Eguinoa I., Juty N., Holmes K., Clark J.A., Capella Gutierrez S., Gray A.J.G., Owen S., Williams A.R., Tartari G., Bacall F., Thelen T. RO Crate Metadata Specification 1.0. 2019.
πŸ“„ 44. Sefton P., O Carragain E., Soiland Reyes S., Corcho O., Garijo D., Palma R., Coppens F., Goble C., Fernandez J.M., Chard K., Gomez Perez J.M., Crusoe M.R., Eguinoa I., Juty N., Holmes K., Clark J.A., Capella Gutierrez S., Gray A.J.G., Owen S., Williams A.R., Tartari G., Bacall F., Thelen T., Menager H., Rodriguez Navas L., Walk P., Whitehead B., Wilkinson M., Groth P., Bremer E., Castro L.G., Sebby K., Kanitz A., Trisovic A., Kennedy G., Graves M., Koehorst J., Leo S., Portier M. RO Crate Metadata Specification 1.1.1. 2021.
Views: 0    Downloads: 0
Views
Downloads