项目作者: justinshenk

项目描述 :
Find duplicates and similar images in a folder
高级语言: Python
项目地址: git://github.com/justinshenk/simages.git
创建时间: 2019-05-22T14:09:39Z
项目社区:https://github.com/justinshenk/simages

开源协议:MIT License

下载


:monkey: simages:monkey:

PyPI version Build Status Documentation Status DOI Binder

Find similar images within a dataset.

Useful for removing duplicate images from a dataset after scraping images with google-images-download.

The Python API returns pairs, duplicates, where pairs are the (ordered) closest pairs and distances is the
corresponding embedding distance.

Install

See the installation docs for all details.

  1. pip install simages

or install from source:

  1. git clone https://github.com/justinshenk/simages
  2. cd simages
  3. pip install .

To install the interactive interface, install mongodb and use rather pip install "simages[all]".

Demo

  1. Minimal command-line interface with simages-show:

simages_demo

  1. Interactive image deletion with simages add/find:
    simages_web_demo

Usage

Two interfaces exist:

  1. minimal interface which plots the duplicates for visual inspection
  2. mongodb + flask interface which allows interactive deletion [optional]

Minimal Interface

In your console, enter the directory with images and use simages-show:

  1. $ simages-show --data-dir .
  1. usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
  2. [--epochs EPOCHS] [--num-channels NUM_CHANNELS]
  3. [--pairs PAIRS] [--zdim ZDIM] [-s]
  4. -h, --help show this help message and exit
  5. --data-dir DATA_DIR, -d DATA_DIR
  6. Folder containing image data
  7. --show-train, -t Show training of embedding extractor every epoch
  8. --epochs EPOCHS, -e EPOCHS
  9. Number of passes of dataset through model for
  10. training. More is better but takes more time.
  11. --num-channels NUM_CHANNELS, -c NUM_CHANNELS
  12. Number of channels for data (1 for grayscale, 3 for
  13. color)
  14. --pairs PAIRS, -p PAIRS
  15. Number of pairs of images to show
  16. --zdim ZDIM, -z ZDIM Compression bits (bigger generally performs better but
  17. takes more time)
  18. -s, --show Show closest pairs

Web Interface [Optional]

Note: To install the web interface API, install and run mongodb and use pip install "simages[all]" to install optional dependencies.

Add your pictures to the database (this will take some time depending on the number of pictures)

  1. simages add <images_folder_path>

A webpage will come up with all of the similar or duplicate pictures:

  1. simages find <images_folder_path>
  1. Usage:
  2. simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
  3. simages remove <path> ... [--db=<db_path>]
  4. simages clear [--db=<db_path>]
  5. simages show [--db=<db_path>]
  6. simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
  7. simages -h | --help
  8. Options:
  9. -h, --help Show this screen
  10. --db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
  11. --parallel=<num_processes> The number of parallel processes to run to hash the image
  12. files (default: number of CPUs).
  13. find:
  14. --print Only print duplicate files rather than displaying HTML file
  15. --delete Move all found duplicate pictures to the trash. This option takes priority over --print.
  16. --match-time Adds the extra constraint that duplicate images must have the
  17. same capture times in order to be considered.
  18. --trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
  19. --epochs=<epochs> Epochs for training [default: 2]

Python APIs

Numpy array

  1. from simages import find_duplicates
  2. import numpy as np
  3. array_data = np.random.random(100, 3, 48, 48)# N x C x H x W
  4. pairs, distances = find_duplicates(array_data)

Folder

  1. from simages import find_duplicates
  2. data_dir = "my_images_folder"
  3. pairs, distances = find_duplicates(data_dir)

Default options for find_duplicates are:

  1. def find_duplicates(
  2. input: Union[str or np.ndarray],
  3. n: int = 5,
  4. num_epochs: int = 2,
  5. num_channels: int = 3,
  6. show: bool = False,
  7. show_train: bool = False,
  8. **kwargs
  9. ):
  10. """Find duplicates in dataset. Either `array` or `data_dir` must be specified.
  11. Args:
  12. input (str or np.ndarray): folder directory or N x C x H x W array
  13. n (int): number of closest pairs to identify
  14. num_epochs (int): how long to train the autoencoder (more is generally better)
  15. show (bool): display the closest pairs
  16. show_train (bool): show output every
  17. z_dim (int): size of compression (more is generally better, but slower)
  18. kwargs (dict): etc, passed to `EmbeddingExtractor`
  19. Returns:
  20. pairs (np.ndarray): indices for closest pairs of images, n x 2 array
  21. distances (np.ndarray): distances of each pair to each other

Embeddings API

  1. from simages import Embeddings
  2. import numpy as np
  3. N = 1000
  4. data = np.random.random((N, 28, 28))
  5. embeddings = Embeddings(data)
  6. # Access the array
  7. array = embeddings.array # N x z (compression size)
  8. # Get 10 closest pairs of images
  9. pairs, distances = embeddings.duplicates(n=5)
  1. In [0]: pairs
  2. Out[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])
  3. In [1]: distances
  4. Out[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])

EmbeddingExtractor API

  1. from simages import EmbeddingExtractor
  2. import numpy as np
  3. N = 1000
  4. data = np.random.random((N, 28, 28))
  5. extractor = EmbeddingExtractor(data, num_channels=1) # grayscale
  6. # Show 10 closest pairs of images
  7. pairs, distances = extractor.show_duplicates(n=10)

Class attributes and parameters:

  1. class EmbeddingExtractor:
  2. """Extract embeddings from data with models and allow visualization.
  3. Attributes:
  4. trainloader (torch loader)
  5. evalloader (torch loader)
  6. model (torch.nn.Module)
  7. embeddings (np.ndarray)
  8. """
  9. def __init__(
  10. self,
  11. input:Union[str, np.ndarray],
  12. num_channels=None,
  13. num_epochs=2,
  14. batch_size=32,
  15. show_train=True,
  16. show=False,
  17. z_dim=8,
  18. **kwargs,
  19. ):
  20. """Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.
  21. Args:
  22. input (np.ndarray or str): data
  23. num_channels (int): grayscale = 1, color = 3
  24. num_epochs (int): more is better (generally)
  25. batch_size (int): number of images per batch
  26. show_train (bool): show intermediate training results
  27. show (bool): show closest pairs
  28. z_dim (int): compression size
  29. kwargs (dict)
  30. """

Specify tne number of pairs to identify with the parameter n.

How it works

simages uses a convolutional autoencoder with PyTorch and compares the latent representations with closely :triangular_ruler:.

Dependencies

simages depends on
the following packages:

The following dependencies are required for the interactive deleting interface:

  • pymongodb
  • fastcluster
  • flask
  • jinja2
  • dnspython
  • python-magic
  • termcolor

Cite

If you use simages, please cite it:

  1. @misc{justin_shenk_2019_3237830,
  2. author = {Justin Shenk},
  3. title = {justinshenk/simages: v19.0.1},
  4. month = jun,
  5. year = 2019,
  6. doi = {10.5281/zenodo.3237830},
  7. url = {https://doi.org/10.5281/zenodo.3237830}
  8. }