Detecting small objects

Why field of view matters


AI Okinawa

Kuba Kolodziejczyk



Profile
名前: Kuba Kolodziejczyk
出身: ポーランド
大学: ロンドン大学, 大阪大学

過去
Nanyang Technological University
OIST
レキサス

現在
AI Okinawa - 代表
LiLz株式会社 - CTO
琉球大学 - 非常勤講師


Single shot detector

SSD: Single Shot MultiBox Detector - sample detections


SSD: Single Shot MultiBox Detector - classification accuracy across objects sizes on VOC dataset


SSD: Single Shot MultiBox Detector - recall accuracy across objects sizes on COCO dataset




Convolutions

$$ I_j^{l+1} = f( \sum_{j} ( I_{j + z}^l * k_z ) + b ) $$

Inputs
\( I_0^1 \) \( I_1^1 \) \( I_2^1 \) \( I_3^1 \) \( I_4^1 \)
Kernel
\( k_0 \) \( k_1 \) \( k_2 \)


Outputs
\( I_0^2 \) \( I_1^2 \) \( I_2^2 \)

Where
\( I_0^2 = f( (I_0^1 * k_0) + (I_1^1 * k_1) + (I_2^1 * k_2) + b ) \)
\( I_1^2 = f( (I_1^1 * k_0) + (I_2^1 * k_1) + (I_3^1 * k_0) + b ) \)
\( I_2^2 = f( (I_2^1 * k_0) + (I_3^1 * k_1) + (I_4^1 * k_0) + b ) \)


Field of view
\( I_0^1 \sim I_2^1 \) \( I_1^1 \sim I_3^1 \) \( I_2^1 \sim I_4^1 \)




Convolutions - more layers
Layer 1
\( I_0^1 \) \( I_1^1 \) \( I_2^1 \) \( I_3^1 \) \( I_4^1 \)


Layer 2
\( I_0^1 \sim I_2^1 \) \( I_1^1 \sim I_3^1 \) \( I_2^1 \sim I_4^1 \)


Layer 3
\( I_0^1 \sim I_4^1 \) \( I_1^1 \sim I_5^1 \) \( I_2^1 \sim I_6^1 \)




Pooling

$$ I_j^{l+1} = max(I_{2j}^l, I_{2j+1}^l) $$

Inputs
\( I_0^1 \) \( I_1^1 \) \( I_2^1 \) \( I_3^1 \) \( I_4^1 \) \( I_5^1 \)


Outputs
\( I_0^2 \) \( I_1^2 \) \( I_2^2 \)
Where
\( I_0^2 = max(I_0^1, I_1^1) \)
\( I_1^2 = max(I_2^1, I_3^1) \)
\( I_2^2 = max(I_4^1, I_5^1) \)


Field of view
\( I_0^1 \sim I_1^1 \) \( I_2^1 \sim I_3^1 \) \( I_4^1 \sim I_5^1 \)




Pooling - more layers
Layer 1
\( I_0^1 \) \( I_1^1 \) \( I_2^1 \) \( I_3^1 \) \( I_4^1 \) \( I_5^1 \)


Layer 2
\( I_0^1 \sim I_1^1 \) \( I_2^1 \sim I_3^1 \) \( I_4^1 \sim I_5^1 \)


Layer 3
\( I_0^1 \sim I_3^1 \) \( I_4^1 \sim I_7^1 \) \( I_8^1 \sim I_{11}^1 \)




Two convolutional blocks
Convolutional block
convolution 3x3
convolution 3x3
convolution 3x3
pooling 2x2


Input
\( I_0^1 \) \( I_1^1 \) \( I_2^1 \) \( I_3^1 \) \( I_4^1 \) \( I_5^1 \)



Block 1 layer 1 - after first convolution
\( I_0^1 \sim I_2^1 \) \( I_1^1 \sim I_3^1 \) \( I_2^1 \sim I_4^1 \)


Block 1 layer 2 - after second convolution
\( I_0^1 \sim I_4^1 \) \( I_1^1 \sim I_5^1 \) \( I_2^1 \sim I_6^1 \)


Block 1 layer 3 - after third convolution
\( I_0^1 \sim I_6^1 \) \( I_1^1 \sim I_7^1 \) \( I_2^1 \sim I_8^1 \)


Block 1 layer 4 - after pooling
\( I_0^1 \sim I_7^1 \) \( I_2^1 \sim I_9^1 \) \( I_4^1 \sim I_{11}^1 \)



Block 2 layer 1 - after first convolution
\( I_0^1 \sim I_{11}^1 \) \( I_2^1 \sim I_{13}^1 \) \( I_4^1 \sim I_{15}^1 \)


Block 2 layer 2 - after second convolution
\( I_0^1 \sim I_{15}^1 \) \( I_2^1 \sim I_{17}^1 \) \( I_4^1 \sim I_{19}^1 \)


Block 2 layer 3 - after third convolution
\( I_0^1 \sim I_{19}^1 \) \( I_2^1 \sim I_{21}^1 \) \( I_4^1 \sim I_{23}^1 \)


Block 2 layer 3 - after pooling
\( I_0^1 \sim I_{21}^1 \) \( I_4^1 \sim I_{25}^1 \) \( I_8^1 \sim I_{29}^1 \)




Single shot detector - objects sizes vs fields of view

SSD: Single Shot MultiBox Detector - architecture

Fields of view after 5th convolutional block
\( I_0^1 \sim I_{212}^1 \) \( I_{33}^1 \sim I_{224}^1 \) \( I_{65}^1 \sim I_{276}^1 \)

    Pascal VOC objects sizes analysis
  • max(width, height) > 100 px: 8.9%
  • max(width, height) < 50 px: 70.6%


213x213 window


213x213 window reduced to 1x1

Problem
Field of view too large to detect small objects




Solution
Add prediction head at block with field of view slightly above desired object sizes

Fields of view after 3rd convolutional block
\( I_0^1 \sim I_{75}^1 \) \( I_{8}^1 \sim I_{83}^1 \) \( I_{17}^1 \sim I_{91}^1 \)


(Top secret VOC-like dataset)


    Before
  • Precision: ~70%
  • Recall: ~20%


    After
  • Precision: ~80%
  • Recall: ~80%




Summary
  • Convolutional layers have associated fields of views
  • Convolutional layers are compressing data
    The larger your field of view, the less fine-grained information you retain
  • Small objects can be detected if you use layers with appropriate fields of views for the task
  • Aim for field of view 1.2x~1.5x of objects sizes to provide layers some context