LAION, the German analysis org that created the information used to coach Secure Diffusion, amongst different generative AI fashions, has launched a brand new dataset that it claims has been “completely cleaned of recognized hyperlinks to suspected little one sexual abuse materials (CSAM).”
The brand new dataset, Re-LAION-5B, is definitely a re-release of an previous dataset, LAION-5B — however with “fixes” applied with suggestions from the nonprofit Web Watch Basis, Human Rights Watch, the Canadian Heart for Little one Safety and the now-defunct Stanford Web Observatory. It’s obtainable for obtain in two variations, Re-LAION-5B Analysis and Re-LAION-5B Analysis-Secure (which additionally removes further NSFW content material), each of which have been filtered for hundreds of hyperlinks to recognized — and “seemingly” — CSAM, LAION says.
“LAION has been dedicated to eradicating unlawful content material from its datasets from the very starting and has applied acceptable measures to attain this from the outset,” LAION wrote in a weblog put up. “LAION strictly adheres to the precept that unlawful content material is eliminated ASAP after it turns into recognized.”
Vital to notice is that LAION’s datasets don’t — and by no means did — include pictures. Slightly, they’re indexes of hyperlinks to pictures and picture alt textual content that LAION curated, all of which got here from a totally different dataset — the Widespread Crawl — of scraped websites and internet pages.
The discharge of Re-LAION-5B comes after an investigation in December 2023 by the Stanford Web Observatory that discovered that LAION-5B — particularly a subset referred to as LAION-5B 400M — included a minimum of 1,679 hyperlinks to unlawful pictures scraped from social media posts and in style grownup web sites. In response to the report, 400M additionally contained hyperlinks to “a variety of inappropriate content material together with pornographic imagery, racist slurs, and dangerous social stereotypes.”
Whereas the Stanford co-authors of the report famous that it could be tough to take away the offending content material and that the presence of CSAM doesn’t essentially affect the output of fashions skilled on the dataset, LAION mentioned it could quickly take LAION-5B offline.
The Stanford report really useful that fashions skilled on LAION-5B “ought to be deprecated and distribution ceased the place possible.” Maybe relatedly, AI startup Runway not too long ago took down its Secure Diffusion 1.5 mannequin from the AI internet hosting platform Hugging Face; we’ve reached out to the corporate for extra info. (Runway in 2023 partnered with Stability AI, the corporate behind Secure Diffusion, to assist practice the unique Secure Diffusion mannequin.)
Of the brand new Re-LAION-5B dataset, which accommodates round 5.5 billion text-image pairs and was launched beneath an Apache 2.0 license, LAION says that the metadata can be utilized by third events to scrub present copies of LAION-5B by eradicating the matching unlawful content material.
LAION stresses that its datasets are supposed for analysis — not industrial — functions. However, if historical past is any indication, that gained’t dissuade some organizations. Past Stability AI, Google as soon as used LAION datasets to coach its image-generating fashions.
“In all, 2,236 hyperlinks [to suspected CSAM] have been eliminated after matching with the lists of hyperlink and picture hashes offered by our companions,” LAION continued within the put up. “These hyperlinks additionally subsume 1008 hyperlinks discovered by the Stanford Web Observatory report in December 2023 … We strongly urge all analysis labs and organizations who nonetheless make use of previous LAION-5B emigrate to Re-LAION-5B datasets as quickly as potential.”