A cnn will learn to recognize patterns across space while rnn is useful for solving temporal data problems One way to keep the capacity while reducing the receptive field size is to add 1x1 conv layers instead of 3x3 (i did so within the denseblocks, there the first layer is a 3x3 conv. Typically for a cnn architecture, in a single filter as described by your number_of_filters parameter, there is one 2d kernel per input channel
Foster Insurance Agency | Grand Rapids MI
$\begingroup$ but if you have separate cnn to extract features, you can extract features for last 5 frames and then pass these features to rnn
And then you do cnn part for.
Firstly when you say an object detection cnn, there are a huge number of model architectures available Considering that you have narrowed down on your model architecture a cnn will.