Network with frequency based depth of information absorption

4 min readDec 29, 2024

January 4, 2022 patent 11215999 “DATA PIPELINE AND DEEP LEARNING SYSTEM FOR AUTONOMOUS DRIVING” was granted.

This a bit of background and further thinking.

Image sensor inside of the automotive camera is operating in HDR mode combining several exposures in real time. Depending on the scene each exposure contains 8–12 bit of significant information and after merging 3 or 4 exposures the resulting image usually has around 16-bit o

Example of HDR process inside a sensor / ISP and GPU system:

Enhanced Dynamic Range

In this article I’d like to briefly describe the work that I did at OmniVision.

medium.com

The ML networks are usually trained and deployed using 8-bit hardware or GPU processing.

Do we need tone-mapping?

The process of quantization imaging data is called tone-mapping and is usually done be decomposing the image into high and low frequency components. Those components undergo through gamma function or look-up tables sometimes dynamically constructed based on collected statistics.

Next the high-pass and low-pass components are added and the signal is quantized to target 8-bit word and such words are sent into the network for training.

How does a high-pass and low pass tree look like? (will add actual filtering results here instead of art link later)

Convolutional filtering tree

Growing with ChatGPT

lunokhod.medium.com

How to construct convolutional filters based on the stretch factor between connected layers?

Frequency guided convolutional filters for resampling

Moscow State University, 2001

medium.com

During tone mapping process described above a lot of information about target object or scene is lost, which makes a lot of sense for training the network on RAW data from image sensor.

But making training and deploying a network which would operate on 24(32)bit data or floating point is computationally very expensive and also modern hardware accelerators are usually optimized for 8-bit compute.

Thus the motivation of this research was to restore the data path between the image sensor and deep layers of neural network by creating proposed visual information data flow, controlling the loss of information and using existing fixed 8/16bit computational platform for the onboard infra.

The proposed architecture was also inspired by studying the behavior and connections between human retinal cells such as photo receptors, ganglion and bi-polar cells. Ganglion and bi-polar cells are doing high-pass and low-pass filtering and transfer the modulation of each frequency into the visual cortex. Its possible that ganglion cells are also providing humans with gaze lock mechanism. To understand the function of ganglion cells this video is helpful:

After analyzing state of the art method described above the conclusion was that main loss of information is taking place when the high-pass and low-pass components are added together prior and quantized as both high and low frequency components after summation are in one value it will be impossible to do lossless reconstruction either low or high frequency components of the original signal for the network as each of the frequency underwent through non-linear transformation.

Adding two components after gamma-like transformation also may result in significant change of the direction of optical flow and loss of monotonies behavior in them together will mix a lot of frequencies near cut-off frequency.

Also due to summation of the new output values oversaturation often occurs and the histogram is often partitioned and distorted.
Thus it was proposed to send each frequency in a separate 8-bit channels. Each frequency bandwidth depends on the kernel size of the receiving layer of the convolutional network.

The first layer of convolutional network receives the highest frequency which carries all fine details and patterns. This signal is zero centered and after transformation has SYM histogram and is very good for feature training of the first convolutional layer and gradient descend.

Lower frequencies will be consumed at a deeper level of the network. During network operation the image size shrinks with depth and the goal is to filter out all the frequencies higher then new Nyquist threshold in the reduced image size to match the corresponding layer of network on perception side.

Such scheme is recurrently applied to low-pass component.

Main benefit of such approach is that we can transmit HDR signal lossless-ly or using controlled amount of compression for each frequency, improve generalization of the learning process and increase data integrity when learning in linear, rather than tone-mapped/gamma spaces.

What's next?

Further on I am going to describe how to choose the convolutional kernels to fuse with the network architecture and illustrate the connection to some well know network architecture such as Yolo or Inception.

To read more how retinal layers in human eye do frequency based decomposition of signal learn about ganglion cells:

Ganglion cells

A retinal ganglion cell (RGC) is a type of neuron located near the inner surface (the ganglion cell layer) of the…