video699.page.siamese¶
This module implements a page detector that matches feature vectors extracted from document page image data with feature vectors extracted from projection screen image data using a Siamese deep convolutional neural network. Related classes and functions are also implemented.
Module Contents¶
-
video699.page.siamese.LOGGER¶
-
video699.page.siamese.ALL_VIDEOS¶
-
video699.page.siamese.CONFIGURATION¶
-
video699.page.siamese.RESOURCES_PATHNAME¶
-
video699.page.siamese.TRAINING_SCREEN_DETECTOR¶
-
video699.page.siamese.VALIDATION_SCREEN_DETECTOR¶
-
video699.page.siamese.feature_tensor_l2_distances(screen_features, page_features)¶ The \(L_2\) distances between a pair of feature tensors.
Parameters: - screen_features (tensor) – A tensor \(\mathbf X\) of features \(\mathbf x\) extracted from a batch of screen images.
- page_features (tensor) – A tensor \(\mathbf Y\) of features \(\mathbf y\) extracted from a batch of page images.
Returns: l2_distances – A tensor of \(L_2\) distances \(\lVert\mathbf{x - y}\rVert_2\).
Return type: tensor
-
class
video699.page.siamese._ImageMoments(videos, screen_detector, documents)¶ Bases:
objectStatistical moments extracted from lit projection screens and from document pages.
Parameters: - videos (iterable of VideoABC) – The video in which we will detect lit projection screens.
- screen_detector (ScreenDetectorABC) – The screen detector that will be used to detect lit projection screens in the video frames. We will extract statistical moments from the screens.
- documents (iterable of DocumentABC) – The documents from whose pages we will extract statistical moments.
-
mean_screen¶ np.array – The mean preprocessed grayscale screen image.
-
inverse_screen_std¶ np.array – The inverse standard deviation of a preprocessed grayscale screen image.
-
mean_page¶ np.array – The mean preprocessed grayscale page image.
-
inverse_page_std¶ np.array – The inverse standard deviation of a preprocessed grayscale page image.
-
class
video699.page.siamese._AnnotatedImagePairs(videos, moments, screen_detector)¶ Bases:
keras.utils.SequenceHuman-annotated pairs of screens and pages produced from a set of human-annotated videos.
For every combination of a screen in the set of human-annotated videos and a document page, a pair is produced. Every matching pair of a screen and a page is assigned a classification label of 0, whereas non-matching pairs are assigned the label of 1. For every screen, the non-matching pairs and the matching pairs have an equal sum of their weights to offset the class imbalance that favors non-matching pairs.
Note
The choice of labels is significant. The sigmoid function predicts the label from the \(L_2\) distance between feature vectors. Close feature vectors will have a distance close to zero, which the sigmoid function will transform to the label 0. Distant feature vectors will have a large distance, which the sigmoid function will transform to the label 1.
Parameters: - videos (set of AnnotatedSampledVideo) – A set of human-annotated videos that will be used to produce pars of projection screens, and document pages.
- moments (_ImageMoments) – Statistical moments used to normalize image data.
- screen_detector (ScreenDetectorABC) – A screen detector that will be used to detect screens in the videos.
-
shuffle(self)¶ Shuffles the pairs of screens and pages.
-
__len__(self)¶
-
__getitem__(self, idx)¶
-
class
video699.page.siamese._KerasSiameseNeuralNetwork(training_videos=None, make_persistent=True)¶ Bases:
objectA Siamese convolutional neural network trained on a set of human-annotated videos.
Notes
All human-annotated videos that are not part of the training set will be used as the validation set. The validation loss and accuracy will be recorded, but they will not be used to influence the training. Therefore, the validation set can still be used as the test set in subsequent evaluation.
Parameters: - training_videos (set of AnnotatedSampledVideo or None, optional) – The human-annotated videos that will be used to train the Siamese deep convolutional neural
network. If
Noneor unspecified, all human-annotated videos will be used as the training set. - make_persistent (bool, optional) – Whether the neural network will be persistently stored for future reuse. When unspecified,
or
False, a neural network will not be stored, but an existing neural network will nevertheless be loaded.
-
regression_model¶ Keras.models.Model – A deep convolutional neural network trained on the training set of human-annotated videos. Given a screen, or a page image, the neural network extracts deep image features.
-
thresholding_model¶ Keras.models.Model – A Siamese deep convolutional neural network trained on the training set of human-annotated videos. Given an \(L_2\) distance between screen image and page image features, the neural network predicts a class label.
-
training_moments¶ _ImageMoments – Statistical moments extracted from the training videos.
-
training_history¶ dict – The history attribute of the
keras.callbacks.Historyobject produced during the training.
-
get_screen_features(self, screens)¶ Extracts deep features from projection screen images.
Parameters: screens (iterable of ScreenABC) – Projection screens from which we will extract deep image features.
Yields: - screen (ScreenABC) – A projection screen.
- screen_features (np.array) – The deep features extracted from the projection screen image.
-
get_page_features(self, pages)¶ Extracts deep features from document page images.
Parameters: pages (iterable of PageABC) – Document pages from which we will extract deep image features.
Yields: - page (PageABC) – A document page.
- page_features (np.array) – The deep features extracted from the document page image.
-
threshold_distances(self, distances)¶ Predicts whether projection screens and document pages match.
Parameters: distances (array_like) – The \(L_2\) distances between projection screen features and document page features. Returns: thresholded_distances – Whether the projection screens and document pages match. Return type: array_like
- training_videos (set of AnnotatedSampledVideo or None, optional) – The human-annotated videos that will be used to train the Siamese deep convolutional neural
network. If
-
video699.page.siamese._preprocess_image(image)¶ Preprocesses an image to be used as an input to a Siamese deep convolutional neural network.
Parameters: image (ImageABC) – An image to be preprocesses. Returns: preprocessed_image – The preprocessed grayscale image to be used as an input to a Siamese deep convolutional neural network. Return type: np.array
-
class
video699.page.siamese.KerasSiamesePageDetector(documents, training_videos=None)¶ Bases:
video699.interface.PageDetectorABCA page detector using approximate nearest neighbor search of deep image features.
A convolutional neural network accepts images on the input and produces feature vectors on the output. During training, two “Siamese” copies of the convolutional neural network with shared weights are produced and topped with a lambda layer that computes the \(L_2\) distance \(d\) between the two output feature vectors, and with a dense one-unit layer with the sigmoid activation function \(S\). Pairs of screen and page images are fed to the Siamese network, and the binary cross-entropy loss function \(-(y\cdot\log(S(d)) + (1 - y)\cdot\log(1 - S(d)))\) is used to evaluate how well the network classifies matching (\(y = 0\)), and non-matching (\(y = 1\)) image pairs. This general Siamese architecture was first proposed by [BromleyEtAl94].
Deep image features are extracted from the image data of the provided document pages using the trained convolutional neural network, and they are placed inside a vector database. Deep image features are then extracted from the image data in a screen and nearest neighbors are retrieved from the vector database. The page images corresponding to the nearest neighbors are paired with the screen image and fed to the Siamese network. The document page with the nearest features that is predicted to match the screen is detected as the page shown in the screen. If none of the pages corresponding to the nearest neighbors is predicted to match the screen by the Siamese neural network, then no page is detected in the screen.
[BromleyEtAl94] Bromley, Jane, et al. “Signature Verification using a ‘Siamese’ Time Delay Neural Network.” Advances in Neural Information Processing Systems. 1994. Parameters: - documents (set of DocumentABC) – The provided document pages.
- training_videos (set of AnnotatedSampledVideo or None, optional) – The human-annotated videos that will be used to train the Siamese deep convolutional neural
network. When
Noneor unspecified, all human-annotated videos will be used.
-
detect(self, frame, appeared_screens, existing_screens, disappeared_screens)¶