video699.page.siamese

This module implements a page detector that matches feature vectors extracted from document page image data with feature vectors extracted from projection screen image data using a Siamese deep convolutional neural network. Related classes and functions are also implemented.

Module Contents

video699.page.siamese.LOGGER
video699.page.siamese.ALL_VIDEOS
video699.page.siamese.CONFIGURATION
video699.page.siamese.RESOURCES_PATHNAME
video699.page.siamese.TRAINING_SCREEN_DETECTOR
video699.page.siamese.VALIDATION_SCREEN_DETECTOR
video699.page.siamese.feature_tensor_l2_distances(screen_features, page_features)

The \(L_2\) distances between a pair of feature tensors.

Parameters:
  • screen_features (tensor) – A tensor \(\mathbf X\) of features \(\mathbf x\) extracted from a batch of screen images.
  • page_features (tensor) – A tensor \(\mathbf Y\) of features \(\mathbf y\) extracted from a batch of page images.
Returns:

l2_distances – A tensor of \(L_2\) distances \(\lVert\mathbf{x - y}\rVert_2\).

Return type:

tensor

class video699.page.siamese._ImageMoments(videos, screen_detector, documents)

Bases: object

Statistical moments extracted from lit projection screens and from document pages.

Parameters:
  • videos (iterable of VideoABC) – The video in which we will detect lit projection screens.
  • screen_detector (ScreenDetectorABC) – The screen detector that will be used to detect lit projection screens in the video frames. We will extract statistical moments from the screens.
  • documents (iterable of DocumentABC) – The documents from whose pages we will extract statistical moments.
mean_screen

np.array – The mean preprocessed grayscale screen image.

inverse_screen_std

np.array – The inverse standard deviation of a preprocessed grayscale screen image.

mean_page

np.array – The mean preprocessed grayscale page image.

inverse_page_std

np.array – The inverse standard deviation of a preprocessed grayscale page image.

class video699.page.siamese._AnnotatedImagePairs(videos, moments, screen_detector)

Bases: keras.utils.Sequence

Human-annotated pairs of screens and pages produced from a set of human-annotated videos.

For every combination of a screen in the set of human-annotated videos and a document page, a pair is produced. Every matching pair of a screen and a page is assigned a classification label of 0, whereas non-matching pairs are assigned the label of 1. For every screen, the non-matching pairs and the matching pairs have an equal sum of their weights to offset the class imbalance that favors non-matching pairs.

Note

The choice of labels is significant. The sigmoid function predicts the label from the \(L_2\) distance between feature vectors. Close feature vectors will have a distance close to zero, which the sigmoid function will transform to the label 0. Distant feature vectors will have a large distance, which the sigmoid function will transform to the label 1.

Parameters:
  • videos (set of AnnotatedSampledVideo) – A set of human-annotated videos that will be used to produce pars of projection screens, and document pages.
  • moments (_ImageMoments) – Statistical moments used to normalize image data.
  • screen_detector (ScreenDetectorABC) – A screen detector that will be used to detect screens in the videos.
shuffle(self)

Shuffles the pairs of screens and pages.

__len__(self)
__getitem__(self, idx)
class video699.page.siamese._KerasSiameseNeuralNetwork(training_videos=None, make_persistent=True)

Bases: object

A Siamese convolutional neural network trained on a set of human-annotated videos.

Notes

All human-annotated videos that are not part of the training set will be used as the validation set. The validation loss and accuracy will be recorded, but they will not be used to influence the training. Therefore, the validation set can still be used as the test set in subsequent evaluation.

Parameters:
  • training_videos (set of AnnotatedSampledVideo or None, optional) – The human-annotated videos that will be used to train the Siamese deep convolutional neural network. If None or unspecified, all human-annotated videos will be used as the training set.
  • make_persistent (bool, optional) – Whether the neural network will be persistently stored for future reuse. When unspecified, or False, a neural network will not be stored, but an existing neural network will nevertheless be loaded.
regression_model

Keras.models.Model – A deep convolutional neural network trained on the training set of human-annotated videos. Given a screen, or a page image, the neural network extracts deep image features.

thresholding_model

Keras.models.Model – A Siamese deep convolutional neural network trained on the training set of human-annotated videos. Given an \(L_2\) distance between screen image and page image features, the neural network predicts a class label.

training_moments

_ImageMoments – Statistical moments extracted from the training videos.

training_history

dict – The history attribute of the keras.callbacks.History object produced during the training.

get_screen_features(self, screens)

Extracts deep features from projection screen images.

Parameters:

screens (iterable of ScreenABC) – Projection screens from which we will extract deep image features.

Yields:
  • screen (ScreenABC) – A projection screen.
  • screen_features (np.array) – The deep features extracted from the projection screen image.
get_page_features(self, pages)

Extracts deep features from document page images.

Parameters:

pages (iterable of PageABC) – Document pages from which we will extract deep image features.

Yields:
  • page (PageABC) – A document page.
  • page_features (np.array) – The deep features extracted from the document page image.
threshold_distances(self, distances)

Predicts whether projection screens and document pages match.

Parameters:distances (array_like) – The \(L_2\) distances between projection screen features and document page features.
Returns:thresholded_distances – Whether the projection screens and document pages match.
Return type:array_like
video699.page.siamese._preprocess_image(image)

Preprocesses an image to be used as an input to a Siamese deep convolutional neural network.

Parameters:image (ImageABC) – An image to be preprocesses.
Returns:preprocessed_image – The preprocessed grayscale image to be used as an input to a Siamese deep convolutional neural network.
Return type:np.array
class video699.page.siamese.KerasSiamesePageDetector(documents, training_videos=None)

Bases: video699.interface.PageDetectorABC

A page detector using approximate nearest neighbor search of deep image features.

A convolutional neural network accepts images on the input and produces feature vectors on the output. During training, two “Siamese” copies of the convolutional neural network with shared weights are produced and topped with a lambda layer that computes the \(L_2\) distance \(d\) between the two output feature vectors, and with a dense one-unit layer with the sigmoid activation function \(S\). Pairs of screen and page images are fed to the Siamese network, and the binary cross-entropy loss function \(-(y\cdot\log(S(d)) + (1 - y)\cdot\log(1 - S(d)))\) is used to evaluate how well the network classifies matching (\(y = 0\)), and non-matching (\(y = 1\)) image pairs. This general Siamese architecture was first proposed by [BromleyEtAl94].

Deep image features are extracted from the image data of the provided document pages using the trained convolutional neural network, and they are placed inside a vector database. Deep image features are then extracted from the image data in a screen and nearest neighbors are retrieved from the vector database. The page images corresponding to the nearest neighbors are paired with the screen image and fed to the Siamese network. The document page with the nearest features that is predicted to match the screen is detected as the page shown in the screen. If none of the pages corresponding to the nearest neighbors is predicted to match the screen by the Siamese neural network, then no page is detected in the screen.

[BromleyEtAl94]Bromley, Jane, et al. “Signature Verification using a ‘Siamese’ Time Delay Neural Network.” Advances in Neural Information Processing Systems. 1994.
Parameters:
  • documents (set of DocumentABC) – The provided document pages.
  • training_videos (set of AnnotatedSampledVideo or None, optional) – The human-annotated videos that will be used to train the Siamese deep convolutional neural network. When None or unspecified, all human-annotated videos will be used.
detect(self, frame, appeared_screens, existing_screens, disappeared_screens)