GitHub’s commercial AI tools are built from open source code


“I’m usually happy to see free-to-use extensions, but when they ultimately benefit large companies, I’m a bit distressed that these companies are extracting value from the work of small authors,” Woods said.

One thing about neural networks is clear, they can remember their training data and make copies. Colin Raffel, a professor of computer science at the University of North Carolina, explained that this risk exists regardless of whether the data involves personal information, medical secrets, or copyrighted code. He co-authored a preprint (not yet peer-reviewed) to check Similar to copy GPT-2 in OpenAI. They found that it is fairly simple to make a model trained on a large amount of text spit out training data. But it is difficult to predict what the model will remember and copy. “Only when you throw it into the world, people use and abuse it, you will really find it,” Rafael said. In view of this, he was surprised to see that GitHub and OpenAI chose to use copyrighted code to train their models.

according to GitHub internal testing, Direct copying occurs in approximately 0.1% of Copilot’s output—according to the company, this is an error that can be overcome rather than an inherent flaw in the AI ​​model. This is enough to cause uproar in the legal department of any for-profit entity (“non-zero risk” is just a “risk” for lawyers), but Raffel points out that this may be no different from employees copying and pasting restricted code. Regardless of automation, humans will Break the rules.Open source developer Ronacher added that most copies of Copilot seem to be relatively harmless-simple solutions to problems appear again and again, or like the notorious earthquake Code, people (incorrectly) copy it into many different code bases. “You can let Copilot trigger funny things,” he said. “If it is used as expected, I don’t think it will be a problem.”

GitHub also stated that it has a possible solution: a way to mark these verbatim outputs as they appear, so that programmers and their lawyers know not to reuse them commercially. But building such a system is not as simple as it sounds. Raffel pointed out that it solves a bigger problem: what if the output is not verbatim, but an approximate copy of the training data? What if only the variable is changed, or a line is expressed in a different way? In other words, how much change does the system need to stop being a copycat? Since code generation software is in its infancy, the legal and ethical boundaries are still unclear.

Andy Silas, director of the Boston University Technical Law Clinic, explained that many legal scholars believe that artificial intelligence developers have considerable freedom in choosing training data. The “fair use” of copyrighted material largely comes down to whether it is “transformed” when it is reused. There are many ways to change the work, such as using it for imitation, criticism, or summary—or, as the court has repeatedly discovered, using it as fuel for algorithms.In a prominent case, the Federal Court Dismiss the lawsuit The publishing group filed a lawsuit against Google Books, arguing that its process of scanning books and using text fragments to let users search for them is an example of fair use. However, Sellars added that how this translates into AI training data has not yet been determined.

He pointed out that it is a bit strange to put codes under the same system as books and art. “We treat the source code as a literary work, even though it has very little resemblance to literature,” he said. We might think that code is relatively practical; the task it accomplishes is more important than the way it is written. But in copyright law, the key is how to express ideas. “If the output of Copilot has the same function as one of the training inputs-the parameters are similar and the results are similar-but the code it outputs is different, this may not involve copyright law,” he said.

The ethics of this situation is another matter. “There is no guarantee that GitHub will keep the interests of independent programmers in mind,” Silas said. He pointed out that Copilot relies on the work of its users, including those who explicitly try to prevent their work from being reused for profit, and it may also reduce the need for these coders by automating more programming. “We should never forget that there is no cognition in the model,” he said. This is statistical pattern matching. The insights and creativity unearthed from data are all human.some Some scholars say Copilot emphasized the need for new mechanisms to ensure that those who generate data for AI get fair compensation.

.

Leave a Reply

Your email address will not be published. Required fields are marked *