Cornell Bioacoustics Scientists Develop a High-Performance Computing Platform for Analyzing Big Data

“High-performance computing with MATLAB enables us to process previously unanalyzed big data. We translate what we learn into an understanding of how human activities affect the health of ecosystems to inform responsible decisions about what humans do in the ocean and on land.”

Challenge

Detect and classify animal sounds in huge sets of acoustic data acquired from oceans, fields, forests, and jungles

Solution

Develop a high-performance computing platform for acoustic data analysis using MATLAB, Parallel Computing Toolbox, and MATLAB Parallel Server

Results

  • Years of development time saved
  • Analysis time reduced from weeks to hours
  • Previously unprocessed data analyzed in days
An acoustic analysis device used by the Bioacoustics Research Program to collect data from large baleen whales and other marine mammals. Photo courtesy Dimitri Ponirakis.

For more than 30 years, scientists have studied local animal populations by recording animal sounds in oceans, jungles, forests, and other natural environments. They use the results to assess the effect of man-made noise on natural environments, monitor endangered animal populations, and investigate animal communication. Passive acoustic monitoring systems record sounds continuously, generating terabytes of data. Scientists are often unable to process even 1% of this data because they lack the necessary advanced algorithms and processing capacity.

Bioacoustics Research Program (BRP) scientists at the Cornell Laboratory of Ornithology analyze vast amounts of acoustic data with MATLAB®, Parallel Computing Toolbox™, and MATLAB Parallel Server™. The project, funded by a grant from the Office of Naval Research and the National Oceanic Partnership Program, is led by two principal investigators from Cornell: Dr. Christopher Clark, senior scientist and director of BRP, and Dr. Peter Dugan, lead data scientist for BRP.

“MATLAB and MATLAB parallel computing tools gave us the flexibility to dynamically improve and adapt the algorithms that we use to process our big acoustic data sets,” says Dr. Clark. “If we were using C++ or a similar language, we would not be able to move as quickly or explore as many scenarios.”

Challenge

Researchers analyzing acoustic data must contend with noise from weather, other animals, and nearby machinery and vehicles. The variability of animal sounds across individuals within a species is a further complication. These two factors—noise and variability—increase the number of false positives and negatives, reducing the detection algorithms’ accuracy.

Processing the hundreds of terabytes of data that BRP is gathering presents another challenge. A typical project involves processing years of raw acoustic data—up to 10TB—recorded on multiple channels. Each channel may capture hundreds of millions of events—sounds that stand out when the data is viewed as a spectrogram. Algorithms tested on small, high-quality samples are often considerably less accurate when applied to larger, noisier data sets.

Lastly, BRP analysis tools must serve a wide range of research initiatives, environments, and shifting requirements. “Answers to our initial research questions often lead to brand-new avenues to explore, and we need to be able to handle these sudden changes in our requirements,” says Dr. Clark.

Solution

BRP data scientists used MATLAB to develop high-performance computing (HPC) software for automatically processing acoustic data.

They begin a detection-classification project by collecting audio clips of the animal they wish to detect, clips of background noise in the animal’s environment, and MAT-files of archived acoustic data. Working in MATLAB, they develop new or refine existing algorithms that detect audio sequences in the archived data similar to those in the clip catalog.

The algorithms use pattern matching, edge detection, connected region analysis, convolution, and other techniques supported by Image Processing Toolbox™ and Signal Processing Toolbox™, as well as machine learning techniques supported by Fuzzy Logic Toolbox™ and Deep Learning Toolbox™.

To evaluate the accuracy of the algorithms, the researchers use Statistics and Machine Learning Toolbox™ to compute receiver operating characteristics (ROC) and other performance curves.

After debugging and optimizing the algorithms on small data sets using Parallel Computing Toolbox, the scientists run them against the full archived data sets on a 64-worker cluster using MATLAB Parallel Server.

The BRP team developed a MATLAB interface that enables researchers to specify the algorithms, data sets, and number of processors.

BRP collaborated with Marinexplore and the Kaggle community to sponsor a worldwide competition in which more than 240 participants submitted algorithms for detecting and classifying the upsweep contact calls of North Atlantic right whales. BRP used their MATLAB HPC platform to identify the most accurate algorithm, which will be used to help prevent ship collisions with the whales.

In addition to detection and classification algorithms, BRP uses MATLAB for noise analysis and acoustic modeling, in which the time and frequency dispersion effects of marine or terrestrial environments are captured and simulated.

Results

  • Years of development time saved. “A study of projected costs showed that if we had to do this on our own, it would take three years, $1 million, and a lot of outside help to develop the kind of HPC platform we needed,” says Dr. Dugan. “With Parallel Computing Toolbox and MATLAB Parallel Server, we developed the platform in under three months.”
  • Analysis time reduced from weeks to hours. “It took one of our algorithms 19 weeks to process 90 days of data,” says Dr. Dugan. “Using Parallel Computing Toolbox and MATLAB Parallel Server, we completed the same analysis on our cluster in 8 hours.”
  • Previously unprocessed data analyzed in days. “One data set captured 100,000 hours of sound. It was so large that we had previously processed less than 1% of it, estimating that it would take a year or more to process the rest,” says Dr. Dugan. “With our MATLAB HPC platform, we processed the data six times, using different detection algorithms, in two days.”

Acknowledgements

Cornell University is among the 1300 universities worldwide that provide campus-wide access to MATLAB and Simulink. With the Campus-Wide License, researchers, faculty, and students have access to a common configuration of products, at the latest release level, for use anywhere—in the classroom, at home, in the lab or in the field.