Paper Bidding With pdfgrep

Sun, Jul 12, 2015

I have a bunch of paper bidding to do for POPL, and in discussing with others I got a great tip from Eran Yahav on using pdfgrep to find papers of interest. I thought I’d quickly write up how I’m using it. First, I downloaded all the submissions. If your conference is using HotCRP, you can conveniently do a search for -conflict:me to get a list of all papers for which you do not have a conflict, and then download them using the links at the bottom. Once you have all the papers in a folder, you can run a search like so:

# find papers mentioning types
pdfgrep -cH types *.pdf | grep -v ":0"

I run with -c to only show which files match and the number of matches—usually I want to open the full PDF to get proper context for the actual matches. The above search can be a bit slow, but fortunately most of us have multicores, and the search is easily parallelized using GNU Parallel:

parallel 'pdfgrep -cH types {}' ::: *.pdf | grep -v ":0"

On my MacBook Pro, this searches around 250 PDF files in less than 20 seconds. I installed pdfgrep and GNU Parallel using MacPorts, but I’m guessing Homebrew or direct installation will also work fine.

UPDATE: I just realized that on Mac, once you have the PDFs downloaded to a folder, you can just search within the folder using Finder, and it will be a lot faster than pdfgrep. But the above can still be useful if you want to do regexp search or otherwise integrate the search results with a command-line workflow.

Transitive Closure

Paper Bidding With pdfgrep