What makes Fish-AIR special?

•Curation: Fish-AIR Dataset uses multimedia objects from several different repositories and products of different pipelines. However, what makes Fish-AIR dataset special is the features that we have maintained while we were working with machine learning scientists. One of the features is curation of the data and the metadata. We check every multimedia object for both filesystem related and biological information. You will find these information in the Vocabulary section.

•ARK:Fish-AIR uses Archival Resource Key (ARK) system to assign an identifier to each multimedia object. The same identifier is also used for naming multimedia files and tracing the progeny information, which makes the dataset powerful with provenance feature.

•AI-Readiness: The Data does not have much value by itself without the availability of Metadata. The diversity and the amount of the terms that has been used for describing the data, FAIRness of the data, and amount of data carpentry required to analyze the data predicts the AI Readiness of the data. While some metadata is describing the quality of the data based on some defined quality metrics, others will provide information about the source and background of the data such as collection event or batch information.

Components of the dataset

•Data: The Fish-AIR Dataset, currently, is composed of fish images of preserved specimens located at biocollections harvested from different data repositories and products of different image processing and machine learning pipelines for different purposes. The data repositories used for harvesting images are GLIN (with the five main institutions; INHS, OSUM, UMMZ, FMNH, UWZM), iDigBio, GBIF, and Morphbank. The processed images that are gathered as products are output of pipelines such as bounding boxing, segmentation, and trait extraction. The repositories, institutions and types of pipelines will increase as the dataset grow by inclusion of more data.

•Metadata: One indispensable feature of every dataset is Metadata. The richness of the metadata defines the FAIRness and AI-Readiness of the data. We have developed our metadata terms as we needed them during the collaboration with ML data consumers. We have adopted the terms from different formats whenever available and we have generated new terms otherwise. See Vocabulary section. Availability of metadata information facilitates functions such as finding/filtering the data, interoperability and reusability, which makes data FAIR.