Some will regard this as an enhancement request. To each his own, but IMO *grep has always had a huge deficiency when processing natural languages due to line breaks. PDFGREP especially because most PDF docs carry a payload of natural language.

If I need to search for “the.orange.menace“ (dots are 1-char wildcards), of course I want to be told of cases like this:

A court whereby no one is above the law found the orange  
menace guilty on 34 counts of fraud..

When processing a natural language a sentence terminator is almost always a more sensible boundary. There’s probably no command older than grep that’s still in use today. So it’s bizarre that it has not evolved much. In the 90s there was a Lexis Nexus search tool which was far superior for natural language queries. E.g. (IIRC):

  • foo w/s bar :: matches if “foo” appears within the same sentence as “bar”
  • foo w/4 bar :: matches if “foo” appears within four words of “bar”
  • foo pre/5 bar :: matches if “foo” appears before “bar”, within five words
  • foo w/p bar :: matches if “foo” appears within the same paragraph as “bar”

Newlines as record separators are probably sensible for all things other than natural language. But for natural language grep is a hack.

  • mozz@mbin.grits.dev
    link
    fedilink
    arrow-up
    2
    ·
    1 month ago

    grep isn’t really designed as a natural language search tool but perl -pe can do a pretty similar thing to what you’re looking for.

    perl -0777 -pe 's/\n/ /g' file.txt | perl -ne 'print "$1\n" while /(.{0,20}(the.orange.menace).{0,20})/g'
    
    • freedomPusher@sopuli.xyzOPM
      link
      fedilink
      arrow-up
      1
      ·
      1 month ago

      grep isn’t really designed as a natural language search tool

      My understanding of GREP history is that Ken Thompson created grep to do some textual analysis on The Federalist Papers, which to me sounds like it was designed for processing natural language. But it was on a PDP-11 which had resource constraints. Lines of text would be more uniform to manage than sentences given limited resources of the 1970s.

      Thanks for the PERL code. Though I might favor sed or awk for that job. Of course that also means complicating emacs’ grep mode facility. And for PDFs I guess I’d opt for pdfgrep’s limitations over doing a text extraction on every PDF.

      • mozz@mbin.grits.dev
        link
        fedilink
        arrow-up
        2
        ·
        1 month ago

        Hm… yeah, I didn’t know that; I just sort of assumed that it was for searching code etc initially, but you are correct.

        BTW I just learned about pcregrep -M which can do a little more directly what you’re asking for – you can do pcregrep -M 'the(.|\n)orange(.|\n)menace' which seems to work, although you may want -A or -B to give a little more useful output also.