Deleting unethical data sets is not good enough

The researcher’s analysis also shows that the data set “Labeled Faces in the Wild” (LFW) launched in 2007 is the first data set that uses facial images captured from the Internet. It has been used many times in the past 15 years. Deformed. Although it was originally used as a resource for evaluating facial recognition models that were only used in research, it is now almost exclusively used to evaluate systems used in the real world. Although there is a warning label on the website of the dataset, it is warned not to use it in this way.

Recently, this data set was reused in a derivative product called SMFRD, which added a mask to each image to advance facial recognition during the pandemic. The author points out that this may bring new ethical challenges. For example, privacy advocates criticized such applications for fostering surveillance—especially by enabling the government to identify masked protesters.

“This is a very important paper because people usually don’t notice the complexity, potential harms and risks of data sets,” said Margaret Mitchell, an artificial intelligence ethics researcher and aS leader in charge of data practices. Did not participate in the research.

She added that for a long time, the culture of the artificial intelligence community has assumed that data exists for use. This article shows how this can cause problems. “It is very important to carefully consider the various values ​​of the data set encoding and to have available data set encodings,” she said.


The study authors provide several suggestions for the advancement of the AI ​​community. First, creators should communicate the intended use of their data sets more clearly through permissions and detailed documentation. They should also set stricter restrictions on access to their data, perhaps by requiring researchers to sign the terms of an agreement or requiring them to fill out an application form, especially if they intend to construct a derived data set.

Second, research meetings should establish specifications on how to collect, label, and use data, and should provide incentives for responsible data set creation. NeurIPS is the largest artificial intelligence research conference and already contains a list of best practices and ethics.

Mitchell suggested going further. As part of the BigScience project, AI researchers collaborated to develop an AI model that can parse and generate natural language under strict ethical standards. She has been trying to create the idea of ​​a data set management organization-composed of people The team not only handles the management, maintenance and use of data, but also cooperates with lawyers, activists and the public to ensure that it meets legal standards. It is collected only with consent. If someone chooses to withdraw personal information, they can be delete. Not all data sets require this type of management organization, but it is certainly necessary for scraped data that may contain biometric or personally identifiable information or intellectual property.

“Data set collection and monitoring is not a one-time task for one or two people,” she said. “If you do this responsibly, it will be broken down into a large number of different tasks that require deep thinking, deep expertise, and a variety of different people.”

In recent years, the field has increasingly believed that more carefully planned data sets will be the key to overcoming many technical and ethical challenges in the industry. It is now clear that building a more responsible data set is not enough. People who work in artificial intelligence must also be committed to maintaining them for a long time and using them in an ethical manner.

Source link

Recommended For You

About the Author: News Center