Some will regard this as an enhancement request. To each his own, but IMO *grep has always had a huge deficiency when processing natural languages due to line breaks. PDFGREP especially because most PDF docs carry a payload of natural language.

If I need to search for “the.orange.menace“ (dots are 1-char wildcards), of course I want to be told of cases like this:

A court whereby no one is above the law found the orange  
menace guilty on 34 counts of fraud..

When processing a natural language a sentence terminator is almost always a more sensible boundary. There’s probably no command older than grep that’s still in use today. So it’s bizarre that it has not evolved much. In the 90s there was a Lexis Nexus search tool which was far superior for natural language queries. E.g. (IIRC):

  • foo w/s bar :: matches if “foo” appears within the same sentence as “bar”
  • foo w/4 bar :: matches if “foo” appears within four words of “bar”
  • foo pre/5 bar :: matches if “foo” appears before “bar”, within five words
  • foo w/p bar :: matches if “foo” appears within the same paragraph as “bar”

Newlines as record separators are probably sensible for all things other than natural language. But for natural language grep is a hack.

  • TootSweet@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 month ago

    But now that the defect has been rooted in…

    Not a defect. What is it with people equating “doesn’t do this one hairbrained thing I want it to” with “broken?”

    It’s not a bug if it works as designed. Unless somewhere some official documentation says (some specific version of) grep supports what you’re advocating for but the actual grep command doesn’t, it’s not a defect. It’s a feature request.

    To qualify as a “bug”, I’d also accept “it used to do this and it doesn’t any more and not on purpose”.

    Even if (say, GNU) grep maintainers decided they’d make grep support what you’re going for, there’d still be design to do. Should it be a flag? Should the regex syntax be extended to support this? Should we add an environment variable? Some combination of the three? Something else? If we go with the flag, what should it be called and what should be its semantic meaning? Should it take an argument? Etc, etc, etc.

    Even assuming this feature is necessary to fulfill “grep’s intended purpose” (and I’m far from convinced it is), that doesn’t make it a bug if it was never designed in to the program.

    • freedomPusher@sopuli.xyzOPM
      link
      fedilink
      arrow-up
      1
      arrow-down
      1
      ·
      edit-2
      1 month ago

      It’s not a bug if it works as designed.

      What you claim here is that software cannot have a defective design. Of course you have design defects. These are the hardest to correct.

      I’d also accept “it used to do this and it doesn’t any more and not on purpose”.

      This is conventional wisdom. Past behavior is no more an indication of correctness than defectiveness. GREP’s purpose was to process natural language. A line feed is not a sensible terminator in that application. For 50 years people just live with the limitation or they worked around it. Or they adapt to single token searches. It does not cease to be defect because workarounds were available.

      that doesn’t make it a bug if it was never designed in to the program.

      The original design was implemented on an extremely resource-poor system by today’s standards, where 64k was HUGE amount of space. It was built to function under limitations that no longer exist. I would say the design is not defective so long as your target platform is a PDP-11 from the 1970s. Otherwise the design should evolve along with the tasks and machines.