What are datasets in ML?

Overview

In Machine Learning (ML), we use several datasets for research and application purposes. These high-quality free datasets are available online. These datasets can be either text-based, in the form of images or speech data.

Sources of datasets

We can access some of the public datasets from the sources listed below:

  1. Kaggle

    Kaggle allows users to explore and access various datasets in different formats.

  2. The Big Bad NLP Database

    This source primarily contains datasets that can be used to perform natural language processing.

  3. Google Dataset Search

    This works similar to Google Scholar, where detailed information about over 25 million datasets is available.

Popular datasets

Some of the popular datasets used in applications of machine learning, deep learning, and data science are listed below:

  • MNIST dataset

    This is a dataset of handwritten digits containing a sample of 70,000 examples. We can use this dataset to learn image classification and simple pattern recognition.

    The dataset can be found herehttp://yann.lecun.com/exdb/mnist/.

  • Sentiment140

    This dataset contains tweets data. We can use it for sentiment analysis. It is 160,000 records with six features. This dataset can be used for natural language processing.

    The dataset can be found herehttps://www.kaggle.com/datasets/kazanova/sentiment140.

  • Credit card fraud detection

    This dataset contains 284,807 credit card transactions with labels. We can use this dataset to build a model for detecting fraudulent activity.

    The dataset can be found herehttps://www.kaggle.com/datasets/mlg-ulb/creditcardfraud.

  • IRIS dataset

    This dataset contains information about petal and sepal width in flowers. It includes three classes with 50 entries each. We use this dataset for learning pattern recognition.

    The dataset can be found herehttps://archive.ics.uci.edu/ml/datasets/Iris.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved