MLOps - Data Engineering

← home

MLOps - Data Engineering

2022/03/22

Data Engineering

Data Ingestion
- Includ synthetic data generations, or data enrichment
- Details
  - Data Sources Identifications
  - Space Estimation
  - Space Location
  - Obtaining Data
  - Backup data
  - Privacy Compliance
  - Metadata catalog
Exploration and Validation
- Data Profiling: Schema, data types, metadata (max, min, avg), user-defined error detectioon
- Details
  - Use RAD tools (Jupyter noteooks) to keep records of data exploration and experimentation
  - Attribute profiling
    - Name
    - Number of records
    - Data type (categotical, numerical, int/float, text, structured, etc.)
    - Numerical Measures (min, max, avg, median)
    - Amount of missing values (or "missing value ratio" = number of absent values / number of records)
    - Type of distribution (Gaussian, unifrom, logarithmic)
  - Label attribute identification
  - Data Visualization (Build a visual representation for value distribution)
  - Attributes Correlation: Compute and analyze the correlations between attributes
  - Additional data: Identify data would be useful for bulding the model (go back to "data ingestion")
Data Wrangling (Cleanig)
- Re-formatting attributes, correcting errors in data, such as missing values imputation
- Details:
  - Data ImputationsData Imputations
    What is imputation
    
    How to impute values
    
    Mean/Median/Mode
    
    Hot-Deck
    
    Model-based
    
    Proper multiple Stochasic Regression
    
    Pattern Submodel Approach
    
    ...
  - Transformations
  - Outliers: Fix or remote outliers
  - Missing values: Fill in missing values (e.g., withj zero, mean, median) or drop their rows or columns
  - Not relevant data: Drop attributes that provide no useful information for the task (relevant for feature engineering)
  - Restructure data: Might include the following operations
    - Reordering record fields by moving columns
    - Creating new record fields through extracting values
    - Combining multiple record fields into a single record fields
    - Filtering datasets by removing sets of records
    - Shifting the granularity of the dataset and the fields associated with records through aggregations and pivots
Data labeling
- Assigned into a specific category
Data splitting (train, validation, test datasets

lukkiddd. 2022, powered by Jekyll Garden

Linkedin | Github | Twitter

MLOps - Data Engineering

Data Engineering

Links to this note