grep/pdfgrep’s inability to match across lines

freedomPusher@sopuli.xyz · edit-2 1 month ago

grep/pdfgrep’s inability to match across lines

mozz@mbin.grits.dev · 1 month ago

grep isn’t really designed as a natural language search tool but perl -pe can do a pretty similar thing to what you’re looking for.

perl -0777 -pe 's/\n/ /g' file.txt | perl -ne 'print "$1\n" while /(.{0,20}(the.orange.menace).{0,20})/g'

freedomPusher@sopuli.xyz · 1 month ago

grep isn’t really designed as a natural language search tool

My understanding of GREP history is that Ken Thompson created grep to do some textual analysis on The Federalist Papers, which to me sounds like it was designed for processing natural language. But it was on a PDP-11 which had resource constraints. Lines of text would be more uniform to manage than sentences given limited resources of the 1970s.

Thanks for the PERL code. Though I might favor sed or awk for that job. Of course that also means complicating emacs’ grep mode facility. And for PDFs I guess I’d opt for pdfgrep’s limitations over doing a text extraction on every PDF.

mozz@mbin.grits.dev · 1 month ago

Hm… yeah, I didn’t know that; I just sort of assumed that it was for searching code etc initially, but you are correct.

BTW I just learned about pcregrep -M which can do a little more directly what you’re asking for – you can do pcregrep -M 'the(.|\n)orange(.|\n)menace' which seems to work, although you may want -A or -B to give a little more useful output also.