Project Approach¶

This section describes the approach taken in developing the 5D Neural Network Interpolator application. The sections have been numbered to correspond to the coursework questions.

Question 3: Data Handling¶

A separate module was created to handle data loading and preprocessing, located in backend/interpylate_fls/data.py. The DataLoader class provides methods for loading datasets from pickle files, validating and preprocessing data, splitting datasets into training, validation, and test sets, and standardizing features.

The load_dataset method loads the dataset from a pickle file, while the inspect_data method validates the dataset structure and performs preprocessing. The split_data method divides the dataset into training, validation, and test sets, and the standardize_data method standardizes the features using StandardScaler.

To handle missing values, the implementation replaces NaN values with column means. This approach assumes the dataset contains a small number of missing values and that we do not know the analytical or statistical relationship between the input features (x terms) and the target variable (y). This is a common and effective approach for handling missing data. An alternative would be to remove rows with missing values, but this would reduce the dataset size and could lead to information loss.

Question 4: Neural Network¶

A separate module was created to handle neural network training and evaluation, located in backend/interpylate_fls/neuralnetwork.py. The NeuralNetwork class provides methods for training, evaluation, and prediction. The train method trains the network on the training set, the evaluate method evaluates performance on the test set using MSE and R² metrics, and the predict method makes predictions for individual inputs.

A lightweight PyTorch neural network was chosen, allowing users to customise the number of hidden layers (1 to 5), neurons per layer (16 to 64), learning rate (0.001 to 0.01), and number of epochs (5 to 200). The network uses ReLU activation, MSE loss function, Adam optimizer, and a fixed random seed of 42 for reproducibility. The model also features, a drop-out layer and early stopping has been enabled for the training step.

This lightweight design keeps the application simple and easy to use while enabling efficient handling of large datasets. The model can be saved and loaded using the save_model and load_model methods, with error handling to ensure correct serialization. As an extension, a larger network with additional customisation options could be implemented, such as allowing users to choose the activation function, loss function, optimizer, and random seed.

It has purposefully been designed to be compatible with the Python libraries used in the project (NumPy, Pandas, Scikit-learn) by ensuring all tensors are on CPU as these libraries are not compatible with GPU tensors.

Note: For this particular task, the neural network was kept lightweight for demonstrative purposes.

Question 5: Backend¶

A comprehensive backend was created to handle API endpoints and data management, located in backend/main.py. The backend uses FastAPI to provide RESTful endpoints for uploading datasets, training models, making predictions, and managing application state.

The ModelState class manages application state, providing methods to set and retrieve data and models. The set_data and get_data methods handle dataset storage, while set_model and get_model manage the trained model. The clear method resets the application state, and has_data and has_model methods check for data and model availability.

Two directories were created in the backend: the uploads folder stores uploaded datasets, and the outputs folder stores the trained neural network and generated plots. This structure allows the application to retrieve information about these files and display them on the frontend. Additional configuration clears these files when the user selects the reset button on the frontend.

The backend Python package makes use of logging to track application progress and errors, located in backend/interpylate_fls/logger.py. Additionally, a plotter module was created to generate learning curves and predictions vs actual values plots, located in backend/interpylate_fls/plotter.py. The plotter uses static methods to organise the code and improve readability.

As an extension task, the interpylate_fls package has been uploaded to PyPI and can be installed using the command pip install interpylate-fls.

Question 6: Frontend¶

The frontend user interface was created using Next.js and Tailwind CSS, located in the frontend/ folder. The interface was intentionally kept simple and intuitive to ensure the project focus was on end-to-end functionality and software development best practices and not frontend complexity

It was important to include status updates on the frontend to allow users to ensure the backend was running and the application was progressing as expected. I have included a status indicators to confirm the uploaded dataset was being processed and the model was being trained.

To improve user experience, sections 2 and 3 are initially hidden and only revealed when the corresponding steps (train and predict) become available. This prevents overwhelming users with information and keeps the interface clean and easy to use.

Dataset statistics are presented at a high level while still providing necessary information for users to understand their uploaded dataset. When configuring the model, customisable hyperparameters are limited to their maximum values to prevent users from attempting to break the application by inputting invalid or extremely large values.

The learning curve plot and predictions vs actual values plot are available on the frontend to help users understand model performance and adjust hyperparameters accordingly. These images can be viewed optionally to keep the UI simple for non-technical users. The prediction section features sliders for inputting feature values and a button to make predictions, with results displayed in a card.

The frontend is designed to be responsive and easy to use on both desktop and mobile devices, improving accessibility. Finally, a link to the GitHub repository is provided to allow users to view the source code and contribute to the project, promoting open-source collaboration.

Question 7: Testing and Reproducibility¶

A comprehensive testing suite was created to validate backend and frontend functionality, located in the backend/tests/ folder. The testing suite is organised into separate files for clarity. A conftest.py file was created to provide shared test data and fixtures for the testing suite, ensuring consistent test setup across all test files.

It is important to note that it was ensured that only temporary files were created during testing and that these files were removed after the tests were completed to prevent any interference with the main application. Clearing the states before each test was also important to ensure that the tests were isolated and did not affect each other too.

Question 8: Benchmarking¶

A benchmarking script was created to evaluate neural network performance across different dataset sizes, located in backend/experiments/performance_benchmark.py. The script covers three key aspects of performance: training time as a function of dataset size, memory usage during training and prediction, and accuracy metrics (MSE and R²) across different dataset sizes.

Following guidance from the C1 class, synthetic datasets of varying sizes were created to test the neural network’s performance. A variety of experiments were run using these datasets, with results saved to the backend/experiments/results/ folder and plots generated in the backend/experiments/figures/ folder.

The full findings from the benchmarking script are available in the Performance and Profiling section of the documentation.

Best Practices¶

Type Hints¶

Type hints have been applied throughout the backend codebase to improve code clarity, enable better IDE support, and facilitate static type checking. All functions and methods in the backend package include type annotations for parameters and return values.

In the interpylate_fls package, type hints are used consistently across all modules. For example:

DataLoader (data.py): Functions are annotated with types such as str for file paths, pd.DataFrame for return types, and Tuple[np.ndarray, ...] for complex return values.

Docstring¶

Consistent docstring formatting has been applied throughout the codebase following a structured approach:

For Classes:

All classes include docstrings with three sections:

Purpose: A clear description of what the class does and its role in the application.
Attributes: A list of all instance attributes with their types and descriptions.
Methods: A list of all public methods with brief descriptions.

Example from the DataLoader class:

class DataLoader:
    """Utility class for loading and preprocessing 5-dimensional datasets.

    Purpose:
        This class handles loading pickle files containing 5D datasets, validating
        data structure, handling missing values, splitting data into train/validation/test
        sets, and standardizing features for neural network training.

    Attributes:
        PATH (str): Path to the pickle file containing the dataset
        scaler (sklearn.preprocessing.StandardScaler): StandardScaler instance for feature normalization

    Methods:
        __init__: Initialize the DataLoader with a file path
        load_dataset: Load a dataset from a pickle file
        inspect_data: Validate, preprocess, and split the dataset
    """

For Functions:

All functions include docstrings with two sections:

Parameters: Each parameter is listed with its type and a description.
Returns: The return type and a description of what is returned.

Example from the load_dataset method:

def load_dataset(self) -> pd.DataFrame:
    """Load a dataset from a pickle file.

    Parameters:
        None (uses self.PATH)

    Returns:
        pandas.DataFrame: The loaded dataset from the pickle file.
    """

Exception Handling¶

Comprehensive exception handling has been implemented throughout the backend to ensure robust error handling and provide meaningful error messages to users.

Example 1: Data Loading and Validation:

The DataLoader class includes validation checks that raise appropriate exceptions:

KeyError: Raised when required feature columns (x1-x5) or target column (y) are missing from the dataset.
ValueError: Raised when data shapes are incorrect or do not match expected dimensions.

Example 2: API Endpoint Error Handling:

Training endpoint: Handles training failures, model saving errors, and plot generation errors, ensuring partial failures don’t crash the application.

Example 3: Error Response Format:

The API endpoints return consistent error response structures. For example:

{
    "status": "error",
    "message": "Descriptive error message explaining what went wrong"
}

Class-Based Design¶

The Python package uses classes to organise functionality and encapsulate related methods and data. This design has several benefits:

Encapsulation and State Management:

Classes allow related functionality and state to be grouped together. For example, the DataLoader class encapsulates the dataset path and scaler instance, ensuring that data preprocessing operations are performed consistently using the same scaler that was fitted on the training data. Similarly, the NeuralNetwork class maintains the model architecture, training history, and data tensors as instance attributes, allowing methods to access shared state without passing parameters repeatedly.

Code Organisation and Modularity:

Each class represents a distinct component of the application:

DataLoader: Handles all data loading and preprocessing operations
NeuralNetwork: Manages model architecture, training, and evaluation
Logger: Provides consistent logging functionality across the application
Plotter: Contains static methods for generating visualisations
ModelState: Manages application state in the backend API

General Notes¶

Shell scripts were created to streamline common tasks, located in the scripts/ folder. The build_docs.sh script builds the Sphinx documentation, and the launch.sh script launches the entire application locally. These scripts simplify the development workflow and make it easier for users to get started with the project.