• sbv@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    9
    ·
    1 year ago

    However, because he knew that the dataset was “being fed by essentially unguided crawling” of the web, including “a significant amount of explicit material,” he also didn’t rule out the possibility that image generators could also be directly referencing CSAM included in the LAION-5B dataset.

    I wonder what a minimum dataset size would be to produce useful LLMs. Dealing with massive a uncurated dataset seems like a bad idea, if you’re concerned about harmful outputs.