The Best 50 Free Datasets for Machine Learning

TagX has collected a list of data sources for Machine Learning (ML) and Natural Language Processing (NLP). In our previous articles, we explained why datasets are such a crucial part of Machine Learning (ML) and Natural Language Processing (NLP). Without these training datasets, machine-learning algorithms would have no way of learning how to do textual mining, textual content classification or categorize products.

This article is one of the best lists of open datasets for Machine learning. They vary from the large(looking at you, Kaggle) to the relatively specific, such as finance or Amazon product datasets.

First, when searching for datasets, some short pointers to keep in mind:

Look for clean datasets because you don't want to spend time on your own cleaning up the info.
Look for datasets, since those are easier to deal with, without too many rows and columns.
An interesting query should be posed that can be answered using the dataset.

Sea of Open Dataset

Where can I download free, open machine learning datasets?

Practicing with multiple tasks is the perfect way to practice machine learning. Using these big dataset finders, you can search and download free datasets online.

Kaggle: A data science platform that features a number of interesting datasets that are externally contributed. In its master list, you can find all sorts of niche datasets, from ramen scores to basketball data to even Seattle pet licenses.

UCI Machine Learning Repository: One of the web's oldest dataset sources, and a perfect first stop when searching for interesting datasets. Although the data sets are user-contributed and may have varying cleanliness levels, the vast majority are clean. You can download data directly, without registration, from the UCI Machine Learning repository.

Government Datasets for Machine Learning

Where can I download Machine Learning Public Government Datasets?

By acting as the base for major economic decisions, demographic data is an important instrument for transforming government and society. Trained machine learning models using public government data will assist policymakers to identify patterns and plan for problems related to population decline or development, aging, and migration.

Data.gov: This platform allows data from various US government departments to be accessed. Data can vary from government budgets to performance scores for schools. However, be warned: much of the knowledge needs additional analysis.

EU Open Data Portal: The EU Open Data Portal offers access to open data released by EU institutions in areas as varied as finance, jobs, research, and the climate.

School System Finance: This dataset was created by means of a survey of the US school system's finances.

US Healthcare Info: The FDA drug database and USDA Food composition database in this dataset have collected data on population health, illnesses, medications, and health plans.

The U.S. National Center for Education Statistics: This site hosts data from the U.S. and around the world on educational institutions and education demographics.

The UK Data Service: Here you can find the largest collection of social, economic and population data from the UK.

Data USA: A detailed visualization of US public data is available on this platform.

Machine Learning Datasets for Finance & Economics

Where can I download datasets for finance and economics for machine learning?

For the financial sector, machine learning is proving a golden opportunity. Quantitative financial records have been held for decades, so the industry is ideally suited to machine learning. Indeed, for algorithmic trading, stock market forecasts, and fraud detection, machine learning is already changing finance and investment banking.

Quandl: A good source of economic and financial data, useful for the prediction of economic indicators or stock prices for building models.

Accessible Data from the World Bank: databases covering population demographics and a large range of global economic and development indicators.

IMF statistics: Data on international finances, debt rates, foreign exchange reserves, commodity prices, and investments are published by the International Monetary Fund.

Financial Times Market Data: Up-to-date statistics, including stock price indices, commodities, and foreign exchange, on financial markets from around the world.

Google Trends: Investigate and evaluate internet search activity details and pattern trends.

American Economic Association (AEA): A good source to find US macroeconomic data.

Datasets for Computer Vision

Where can I download Computer Vision Image Datasets?

Picture datasets, such as medical imaging technology, autonomous vehicles, and facial recognition, are helpful for training a wide variety of computer vision applications.

Labelme: A large annotated image dataset.

ImageNet: For modern algorithms, the de-facto image dataset. It is structured according to the hierarchy of WordNet, in which hundreds and thousands of images are represented by each node of the hierarchy.

LSUN: With several ancillary activities, scene comprehension (room layout estimation, saliency prediction, etc.)

MS COCO: Generic comprehension and captioning of pictures.

COIL100: 100 different 360-rotation objects pictured at every angle.

Visual Genome: Visual knowledge base with very detailed captioning of ~100K images.

Google's Free Images: A set of 9 million image URLs that have been annotated under Creative Commons with labels covering over 6,000 categories.

Labelled Faces in the Wild: 13,000 labelled images of human faces for use in the production of facial recognition applications.

Dataset for Stanford Dogs: Includes 20,580 photos and 120 distinct types of dog breeds.

Indoor Scene Recognition: A very particular dataset, useful because 'outside' is easier for most scene recognition models. It includes 67 indoor categories and a total of 15620 pictures.

VisualQA: There are open-ended questions related to 265,016 photos in this dataset. The questions posed involve knowledge of vision and language to respond.

Sentiment Analysis Datasets for Machine Learning

Where can I download datasets for machine learning with sentiment analysis?

To learn effectively, sentiment analysis models require massive, specialized datasets. Any of the endless ways that you can enhance your sentiment analysis algorithm should be suggested in the following list.

Dataset Multidomain Sentiment Analysis: A slightly older dataset that includes Amazon product reviews.

IMDB Ratings: Featuring 25,000 movie reviews, an older, comparatively limited dataset for binary sentiment classification.

Standard sentiment Treebank: Stanford Sentiment Treebank.

Sentiment140: A common dataset that uses 160,000 tweets with pre-removed emoticons.

Twitter US Airline Sentiment: February 2015 data from Twitter on US airlines, categorized as optimistic, negative, and neutral tweets.

Datasets on Natural Language Processing

Where can I download open natural language processing datasets?

A huge area of study is natural language processing, but the following list contains a wide variety of datasets for various tasks of natural language processing, such as speech recognition and chatbots.

Enron Dataset: Email info, organized into directories, from the senior management of Enron.

Amazon Reviews: Includes nearly 35 million Amazon reviews spanning 18 years. Data includes product and user data, reviews, and analysis of plaintext.

Ngrams Google Books: A list of Google book words.

Blogger Corpus: A 681,288 series of blog posts compiled from blogger.com. There are a minimum of 200 occurrences of widely used English words in each blog.

Wikipedia Links Data: Wikipedia's full text. Nearly 1.9 billion words from more than 4 million papers are in the dataset. You can scan for a paragraph by title, phrase or part of it itself.

Gutenberg eBooks List: Project Gutenberg's annotated list of ebooks.

Hansards Text Chunks from the Canadian Parliament: 1.3 million texts in pairs from the 36th Parliament's documents.

Jeopardy: Archive of over 200,000 questions from the Jeopardy quiz show.

SMS Spam Compilation in English: A dataset consisting of 5,574 SMS spam messages in English.

Yelp Reviews: More than 5 million reviews are included in an open dataset published by Yelp.

UCI's Spambase: A major dataset of spam addresses, useful for filtering spam.

Datasets for Autonomous Vehicles

Where do I download open datasets for autonomous vehicle training?

Autonomous vehicles need to be trained with vast volumes of high-quality datasets so that their environment and surrounding objects can be viewed accurately.

Berkeley DeepDrive BDD100k: The largest self-driving AI dataset at present. Contains more than 100,000 views of driving journeys of over 1,100 hours through various periods of the day and weather conditions. The annotated pictures are from the regions of New York and San Francisco.

Baidu Apolloscapes: Broad dataset of images defining 26 distinct semantic objects such as vehicles, motorcycles, pedestrians, homes, street lights, etc.

Comma.ai: More than 7 hours of traveling on highways. Car speed, acceleration, steering angle, and GPS coordinates are included in the data.

Oxford's Robotic Car: Taken over a span of one year, over 100 repetitions of the same path via Oxford, UK. Along with long-term improvements such as construction and roadworks, the dataset records various combinations of weather, traffic and pedestrians.

Cityscape Dataset: A broad dataset of 50 different cities that tracks urban street scenes.

KUL Belgium Traffic Sign Dataset: More than 10000+ traffic sign annotations in the area of Flanders in Belgium from thousands of physically distinct traffic signs.

MIT AGE Lab: A sample of the 1,000+ hours obtained at AgeLab of multi-sensor driving datasets.

LISA: This dataset contains traffic signals, vehicle identification, traffic lights, and trajectory patterns. LISA: Laboratory for Intelligent & Secure Cars, UC San Diego.

Are you still struggling to find what you need? TagX has been developing extensive, reliable datasets for machine learning projects. We are well placed to create the custom dataset you have been looking for with highly trained linguists working across languages.