We propose to Document Analysis community a new dataset of Images and associated Texts. We collected receipts to construct a corpus of genuine and anonymous documents in order to create a benchmark for the evaluation of fraud detection approaches.

This dataset is currently composed of 1969 images of receipts and the associated OCR result for each. 250 of them have been altered. A competition, in conjunction with ICPR2018, is ongoing to detect them among others and to localize alterations within falsified receipts. The dataset will be available after the end of the competition.

You have the possibility to contribute to the construction of a better dataset by correcting OCR results. That would allow the DA community to obtain a high quality ground truth and to work with text-based methods.

Your challenge: to be at the top of the ranking of the most prolific correctors!

We will provide the manually corrected text corpus when it will be finished. The following step will be to falsify some receipts, to have the final corpus.

This dataset was presented at the First International Workshop on Computational Document Forensics (in conjunction with ICDAR 2017).

ChloĆ© Artaud, Antoine Doucet, Jean-Marc Ogier, Vincent Poulain d’Andecy