Biology Guided Neural Network – Tulane Team

A Landing Page

Tulane’s role in the project has been providing expertise on fish taxonomy, morphology and nomenclature to other project investigators, gathering fish specimen images from image repositories for use in the project, either by searching for multimedia files in iDigBio’s portal and downloading them or having the institutions send us hard drives containing the images. Once the images were gathered, we associated any metadata that accompanied the images, including collection event metadata from FishNet2, updated and cleaned the taxonomic names associated with the images using Eschmeyer’s Catalog of Fishes and the World Registry of Marine Organisms (WoRMS) , filtered the images to confirm that we were working with the right kinds of images for our studies (lateral views of external surfaces of whole fish specimens), then employed humans in a manual task of visualizing the images and capturing image-quality (IQ) metadata from them.

Image Repositories

We have started with iDigBio repository which is available online. The images were varying so much amongst the collections and there were so many different types of images and we had to provide a clean dataset as soon as possible and we have received images from GLIN project, so we preferred to start working with them.

iDigBio – Integrated Digitized Biocollections iDigBio repository is an online national image repository that serves data and multimedia files belong to millions (>40m) of biological specimens located in museums, herbaria and other collections. The multimedia files are mostly images however they are not all in the same type. CT scans, drawings, X-Ray images, skeletons, fossils, specimen labels, and many other different types of images can be found most of them without any metadata captured specifying type of the image. Also, taxonomic names are not always reliable since the institutions providing the names are not very careful about curation of the data. Another problem with the iDigBio repository is, some of the image files are located somewhere else and are not always available for downloading.

GLIN - Great Lakes Invasives Network Bi-national thematic collections network of >20 institutions from eight states and Canada digitized >1.7 million biological specimens representing 2,550 species of exotic fish, clams, snails, mussels, algae, plants, and their look-alikes documented to occur in the North America's Great Lakes Basin. We have contacted six museums and retrieved the data from five of them on hard drives. Data quality, amount of images, way of digitizing, and the species diversity was different amongst the museums, however mostly acceptable.


Metadata Dictionary

Multimedia
Image Quality Metadata

Extended Image Metadata

Batch

The challenges we came across with the data especially with the images in iDigBio dataset pushed us to create a workflow for generating AI ready dataset. This AI ready dataset includes not just clean images with uniform background and no color issues etc, but also providing a list of images in a desired format that includes corrected scientific names, file names and URL information with a unique id number.

The workflow consists of three blocks;

  • Gathering/Harvesting
  • Cleaning/Filtering
  • Publishing


Gathering/Harvesting

IN PROGRESS...


Cleaning/Filtering

IN PROGRESS...


Publishing

IN PROGRESS...

Henry L. (Hank) Bart, Jr.
hbartjr@tulane.edu

Principal Investigator

Yasin Bakış
ybakis@tulane.edu

Co-Principal Investigator

Xiaojun Wang
xwang48@tulane.edu

Software Programmer