Data Engineering
- Data Ingestion
    - Includ synthetic data generations, or data enrichment
- Details
        - Data Sources Identifications
- Space Estimation
- Space Location
- Obtaining Data
- Backup data
- Privacy Compliance
- Metadata catalog
 
 
- Exploration and Validation
    - Data Profiling: Schema, data types, metadata (max, min, avg), user-defined error detectioon
- Details
        - Use RAD tools (Jupyter noteooks) to keep records of data exploration and experimentation
- Attribute profiling
            - Name
- Number of records
- Data type (categotical, numerical, int/float, text, structured, etc.)
- Numerical Measures (min, max, avg, median)
- Amount of missing values (or "missing value ratio" = number of absent values / number of records)
- Type of distribution (Gaussian, unifrom, logarithmic)
 
- Label attribute identification
- Data Visualization (Build a visual representation for value distribution)
- Attributes Correlation: Compute and analyze the correlations between attributes
- Additional data: Identify data would be useful for bulding the model (go back to "data ingestion")
 
 
- Data Wrangling (Cleanig)
    - Re-formatting attributes, correcting errors in data, such as missing values imputation
- Details:
        - Data ImputationsData Imputations
 What is imputation
 
 How to impute values
 
 
 
 Mean/Median/Mode
 
 
 Hot-Deck
 
 
 Model-based
 
 
 Proper multiple Stochasic Regression
 
 
 Pattern Submodel Approach
 
 
 
 
 ...
- Transformations
- Outliers: Fix or remote outliers
- Missing values: Fill in missing values (e.g., withj zero, mean, median) or drop their rows or columns
- Not relevant data: Drop attributes that provide no useful information for the task (relevant for feature engineering)
- Restructure data: Might include the following operations
            - Reordering record fields by moving columns
- Creating new record fields through extracting values
- Combining multiple record fields into a single record fields
- Filtering datasets by removing sets of records
- Shifting the granularity of the dataset and the fields associated with records through aggregations and pivots
 
 
- Data ImputationsData Imputations
 
- Data labeling
    - Assigned into a specific category
 
- Data splitting (train, validation, test datasets