Thursday, May 15, 2025
No Result
View All Result
newshub
  • Global news
  • Financial insights
    • Africa
    • Asia
    • Australia
    • Central Banks
    • China
    • Commodities
    • Corporate
    • Europe
    • Fintech
      • AI
      • Banking
      • Blockchain
      • Crypto
      • MSTRpay
      • Neobanking
    • Investment
    • Japan
    • South East Asia
    • Stock of the week
    • UK
    • US
  • Climate & energy
    • Climate
    • Carbon
    • Coal
    • Disruptive
    • Gas
    • Nuclear
    • Oil
    • Solar
    • Water
    • Waves
    • Wind
    • Renewable
    • South America
  • Lifestyle
    • Best chefs
    • Cocktail of the week
    • History
    • Influential women
  • WEX
    • Alt Kap Holding AB
    • Digital Network Holding, Inc.
    • Fantas-E AB
    • International Clean Energy Inc.
    • Intritum Partner Limited
    • Intritum Recycling GH Limited
    • MSTRpay AB
    • SWAP Services, Inc.
    • VMT Holding, Inc.
    • Universal Streaming Technologies – USTA
    • TC Unterhaltungselektronik AG
  • Global news
  • Financial insights
    • Africa
    • Asia
    • Australia
    • Central Banks
    • China
    • Commodities
    • Corporate
    • Europe
    • Fintech
      • AI
      • Banking
      • Blockchain
      • Crypto
      • MSTRpay
      • Neobanking
    • Investment
    • Japan
    • South East Asia
    • Stock of the week
    • UK
    • US
  • Climate & energy
    • Climate
    • Carbon
    • Coal
    • Disruptive
    • Gas
    • Nuclear
    • Oil
    • Solar
    • Water
    • Waves
    • Wind
    • Renewable
    • South America
  • Lifestyle
    • Best chefs
    • Cocktail of the week
    • History
    • Influential women
  • WEX
    • Alt Kap Holding AB
    • Digital Network Holding, Inc.
    • Fantas-E AB
    • International Clean Energy Inc.
    • Intritum Partner Limited
    • Intritum Recycling GH Limited
    • MSTRpay AB
    • SWAP Services, Inc.
    • VMT Holding, Inc.
    • Universal Streaming Technologies – USTA
    • TC Unterhaltungselektronik AG
No Result
View All Result
newshub
No Result
View All Result
ADVERTISEMENT

Transparency is often lacking in datasets used to train large language models

2024/09/01/09:31
in AI
Reading Time: 5 mins read
259 10
A A
Transparency is often lacking in datasets used to train large language models
MSTRpay MSTRpay MSTRpay
ADVERTISEMENT

Researchers developed an easy-to-use tool that enables an AI practitioner to find data that suits the purpose of their model, which could improve accuracy and reduce bias.

In order to train more powerful large language models, researchers use vast dataset collections that blend diverse data from thousands of web sources.

But as these datasets are combined and recombined into multiple collections, important information about their origins and restrictions on how they can be used are often lost or confounded in the shuffle.

Not only does this raise legal and ethical concerns, it can also damage a model’s performance. For instance, if a dataset is miscategorized, someone training a machine-learning model for a certain task may end up unwittingly using data that are not designed for that task.

In addition, data from unknown sources could contain biases that cause a model to make unfair predictions when deployed.

To improve data transparency, a team of multidisciplinary researchers from MIT and elsewhere launched a systematic audit of more than 1,800 text datasets on popular hosting sites. They found that more than 70 per cent of these datasets omitted some licensing information, while about 50 per cent had information that contained errors.

Building off these insights, they developed a user-friendly tool called the Data Provenance Explorer that automatically generates easy-to-read summaries of a dataset’s creators, sources, licenses, and allowable uses.

“These types of tools can help regulators and practitioners make informed decisions about AI deployment, and further the responsible development of AI,” says Alex “Sandy” Pentland, an MIT professor, leader of the Human Dynamics Group in the MIT Media Lab, and co-author of a new open-access paper about the project.

The Data Provenance Explorer could help AI practitioners build more effective models by enabling them to select training datasets that fit their model’s intended purpose. In the long run, this could improve the accuracy of AI models in real-world situations, such as those used to evaluate loan applications or respond to customer queries.

“One of the best ways to understand the capabilities and limitations of an AI model is understanding what data it was trained on. When you have misattribution and confusion about where data came from, you have a serious transparency issue,” says Robert Mahari, a graduate student in the MIT Human Dynamics Group, a JD candidate at Harvard Law School, and co-lead author on the paper.

ADVERTISEMENT

Mahari and Pentland are joined on the paper by co-lead author Shayne Longpre, a graduate student in the Media Lab; Sara Hooker, who leads the research lab Cohere for AI; as well as others at MIT, the University of California at Irvine, the University of Lille in France, the University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift. The research is published today in Nature Machine Intelligence.

Focus on fine-tuning
Researchers often use a technique called fine-tuning to improve the capabilities of a large language model that will be deployed for a specific task, like question-answering. For finetuning, they carefully build curated datasets designed to boost a model’s performance for this one task.

The MIT researchers focused on these fine-tuning datasets, which are often developed by researchers, academic organizations, or companies and licensed for specific uses.

When crowdsourced platforms aggregate such datasets into larger collections for practitioners to use for fine-tuning, some of that original license information is often left behind.

“These licenses ought to matter, and they should be enforceable,” Mahari says.

For instance, if the licensing terms of a dataset are wrong or missing, someone could spend a great deal of money and time developing a model they might be forced to take down later because some training data contained private information.

“People can end up training models where they don’t even understand the capabilities, concerns, or risk of those models, which ultimately stem from the data,” Longpre adds.

To begin this study, the researchers formally defined data provenance as the combination of a dataset’s sourcing, creating, and licensing heritage, as well as its characteristics. From there, they developed a structured auditing procedure to trace the data provenance of more than 1,800 text dataset collections from popular online repositories.

After finding that more than 70 per cent of these datasets contained “unspecified” licenses that omitted much information, the researchers worked backwards to fill in the blanks. Through their efforts, they reduced the number of datasets with “unspecified” licenses to around 30 per cent.

Their work also revealed that the correct licenses were often more restrictive than those assigned by the repositories.   

In addition, they found that nearly all dataset creators were concentrated in the global north, which could limit a model’s capabilities if it is trained for deployment in a different region. For instance, a Turkish language dataset created predominantly by people in the U.S. and China might not contain any culturally significant aspects, Mahari explains.

“We almost delude ourselves into thinking the datasets are more diverse than they actually are,” he says.

Interestingly, the researchers also saw a dramatic spike in restrictions placed on datasets created in 2023 and 2024, which might be driven by concerns from academics that their datasets could be used for unintended commercial purposes.

A user-friendly tool
To help others obtain this information without the need for a manual audit, the researchers built the Data Provenance Explorer. In addition to sorting and filtering datasets based on certain criteria, the tool allows users to download a data provenance card that provides a succinct, structured overview of dataset characteristics.

“We are hoping this is a step, not just to understand the landscape, but also help people going forward to make more informed choices about what data they are training on,” Mahari says.

In the future, the researchers want to expand their analysis to investigate data provenance for multimodal data, including video and speech. They also want to study how terms of service on websites that serve as data sources are echoed in datasets.

WE/X WE/X WE/X
ADVERTISEMENT

As they expand their research, they are also reaching out to regulators to discuss their findings and the unique copyright implications of fine-tuning data.

“We need data provenance and transparency from the outset, when people are creating and releasing these datasets, to make it easier for others to derive these insights,” Longpre says.

“Many proposed policy interventions assume that we can correctly assign and identify licenses associated with data, and this work first shows that this is not the case, and then significantly improves the provenance information available,” says Stella Biderman, executive director of EleutherAI, who was not involved with this work. “In addition, section 3 contains relevant legal discussion. This is very valuable to machine learning practitioners outside companies large enough to have dedicated legal teams. Many people who want to build AI systems for public good are currently quietly struggling to figure out how to handle data licensing, because the internet is not designed in a way that makes data provenance easy to figure out.”

Source: MIT

Related Posts

US tech firms strike AI deals as Trump tours Gulf states
AI

US tech firms strike AI deals as Trump tours Gulf states

by newshub
23 hours ago

US technology companies have signed major artificial intelligence agreements with Gulf nations during former President Donald Trump’s high-profile visit to...

Read moreDetails
NVIDIA Dynamo: Scaling AI inference with open-source efficiency

NVIDIA Dynamo: Scaling AI inference with open-source efficiency

2 months ago
OpenAI and Musk agree to fast tracked trial over for-profit shift

OpenAI and Musk agree to fast tracked trial over for-profit shift

2 months ago
Oracle launches GenAI-based agents to fight financial crime

Oracle launches GenAI-based agents to fight financial crime

2 months ago
The role of Artificial Intelligence in personal finance: A game-changer for consumers

The role of Artificial Intelligence in personal finance: A game-changer for consumers

2 months ago
Trust meets efficiency: AI and blockchain mutuality

Trust meets efficiency: AI and blockchain mutuality

2 months ago
No Result
View All Result

Recent Posts

  • Bitcoin remains unmatched as a global inflation hedge
  • Elon Musk faces criticism over ‘white homicide’ claims in AI bots
  • US and Qatar sign defense and aviation deal as Trump doubles down on luxury aircraft gift
  • 5 most read worldwide: Credit defaults rise and gold steadies as global markets eye risk
  • Putin declines Istanbul peace talks; Zelenskyy to attend amid cautious optimism

Recent Comments

    Archives

    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022

    Categories

    • Africa
    • AI
    • An diesem Tag
    • Asia
    • Australia
    • Banking
    • Best chefs
    • Biden
    • Blockchain
    • Blockchain technology
    • Carbon
    • Central Banks
    • China
    • Climate
    • Climate & Energy
    • Coal
    • Cocktail of the week
    • Commodities
    • Corporate
    • Crypto
    • Deutsch
    • Deutsch PR
    • English PR
    • Europe
    • Financial insights
    • Focus on neobanking
    • Gas
    • Global news
    • Harris
    • History
    • India
    • Influential women
    • Invest and Rest
    • Italiano PR
    • Japan
    • Lifestyle
    • Metaverse
    • MSTRpay
    • Neobanking
    • News
    • newshub special
    • newshub-special
    • NFT
    • Nobel Prizes 2024
    • Nuclear
    • Oil
    • Press
    • Press releases
    • Pressroom
    • Renewable
    • Russia
    • Solar
    • South America
    • South East Asia
    • Stock of the week
    • Stocks
    • Svensk PR
    • Tech
    • Trump
    • Trump trials
    • UFO
    • UK
    • UK News
    • Ukraine
    • US
    • US politics
    • Waves
    • WEX
    • Wind

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    Recent Posts

    • Bitcoin remains unmatched as a global inflation hedge
    • Elon Musk faces criticism over ‘white homicide’ claims in AI bots
    • US and Qatar sign defense and aviation deal as Trump doubles down on luxury aircraft gift
    • 5 most read worldwide: Credit defaults rise and gold steadies as global markets eye risk
    • Putin declines Istanbul peace talks; Zelenskyy to attend amid cautious optimism

    Categories

    • Africa
    • AI
    • An diesem Tag
    • Asia
    • Australia
    • Banking
    • Best chefs
    • Biden
    • Blockchain
    • Blockchain technology
    • Carbon
    • Central Banks
    • China
    • Climate
    • Climate & Energy
    • Coal
    • Cocktail of the week
    • Commodities
    • Corporate
    • Crypto
    • Deutsch
    • Deutsch PR
    • English PR
    • Europe
    • Financial insights
    • Focus on neobanking
    • Gas
    • Global news
    • Harris
    • History
    • India
    • Influential women
    • Invest and Rest
    • Italiano PR
    • Japan
    • Lifestyle
    • Metaverse
    • MSTRpay
    • Neobanking
    • News
    • newshub special
    • newshub-special
    • NFT
    • Nobel Prizes 2024
    • Nuclear
    • Oil
    • Press
    • Press releases
    • Pressroom
    • Renewable
    • Russia
    • Solar
    • South America
    • South East Asia
    • Stock of the week
    • Stocks
    • Svensk PR
    • Tech
    • Trump
    • Trump trials
    • UFO
    • UK
    • UK News
    • Ukraine
    • US
    • US politics
    • Waves
    • WEX
    • Wind

    Archives

    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    WE/X WE/X WE/X
    newshub

    © 2023-2025
    A part of MSTRpay
    MSTRpay
    Legal & Disclosure

    • Global news
    • Financial insights
    • Climate & energy
    • Lifestyle
    • WEX

    Welcome Back!

    Login to your account below

    Forgotten Password?

    Retrieve your password

    Please enter your username or email address to reset your password.

    Log In
    Please enter CoinGecko Free Api Key to get this plugin works.

    Add New Playlist

    No Result
    View All Result
    • Global news
    • Financial insights
      • Africa
      • Asia
      • Australia
      • Central Banks
      • China
      • Commodities
      • Corporate
      • Europe
      • Fintech
        • AI
        • Banking
        • Blockchain
        • Crypto
        • MSTRpay
        • Neobanking
      • Investment
      • Japan
      • South East Asia
      • Stock of the week
      • UK
      • US
    • Climate & energy
      • Climate
      • Carbon
      • Coal
      • Disruptive
      • Gas
      • Nuclear
      • Oil
      • Solar
      • Water
      • Waves
      • Wind
      • Renewable
      • South America
    • Lifestyle
      • Best chefs
      • Cocktail of the week
      • History
      • Influential women
    • WEX
      • Alt Kap Holding AB
      • Digital Network Holding, Inc.
      • Fantas-E AB
      • International Clean Energy Inc.
      • Intritum Partner Limited
      • Intritum Recycling GH Limited
      • MSTRpay AB
      • SWAP Services, Inc.
      • VMT Holding, Inc.
      • Universal Streaming Technologies – USTA
      • TC Unterhaltungselektronik AG

    © 2023-2025
    A part of MSTRpay
    MSTRpay
    Legal & Disclosure