Project Structure

Learn to organize projects to ensure order and maintainability.

As a data scientist, it’s crucial to have a systematic approach when working on any project. Not only does being systematic make the project manageable, it also helps maintain code efficiency and makes collaboration easier. This lesson will discuss the best practices for project organization, including how to structure working directories, source files, and functions to ensure a clean, maintainable code base.

With a sound organizational system, we can easily find what we need and make changes or updates to the code. A good organizational system ensures our projects are easy to navigate, debug, and share with others. It also makes it easier to pick up where we’ve left off, especially when working on a project for an extended period.

Press + to interact
Well-organized projects are easier to maintain
Well-organized projects are easier to maintain

Organizing the working directory

The first step to structuring our projects is to ensure that we have a well-organized working directory. In data science, organizing our directory with subfolders for input data, outputs, source files, and a main script is often best. This structure makes it easy to find what we need and ensures that the project can be easily reproduced by anyone who may need to work on it.

  • Input data: The input data folder should contain all the data for a project. This includes any raw or cleaned data that’s contained in a file. If input data comes from a database or other live sources, it doesn’t need to be saved in this folder. Any data dictionaries or reference input files should also be in the input folder. Having all the data in one place ensures that we can easily find and access it whenever needed. It also makes it easy to back up and version data files, which is essential when working on a long-term project.

  • Outputs: The outputs folder should contain any outputs generated by our code. This includes any plots, tables, or other files the code produces. Having all the results in one place makes it easy to find and access them and ensures that they can be easily shared. In data science, this can be especially helpful because the output folder often contains much of what needs to be shared with others.

  • Source files: The source files folder will house all of our reference scripts, which are generally where we’ve defined any custom functions.

  • Main: The main script will be central location of the project. It’ll be responsible for loading data, making function calls, referencing our source files, and generating our outputs.

Source files

The source files ...