Tobacco800

Tobacco800 is a public subset of the complex document image processing (CDIP) test collection constructed by Illinois Institute of Technology, assembled from 42 million pages of documents (in 7 million multi-page TIFF images) released by tobacco companies under the Master Settlement Agreement and originally hosted at UCSF.

Tobacco800, composed of 1290 document images, is a realistic database for document image analysis research as these documents were collected and scanned using a wide variety of equipment over time. In addition, a significant percentage of Tobacco800 are consecutively numbered multi-page business documents, making it a valuable testbed for various content-based document image retrieval approaches. Resolutions of documents in Tobacco800 vary significantly from 150 to 300 DPI and the dimensions of images range from 1200 by 1600 to 2500 by 3200 pixels.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • University of Maryland, Laboratory for Language, and Media Processing (LAMP). Tobacco-800 signatures and logos dataset. http://lamp.cfar.umd.edu, 2006

Modalities


Languages