Usage

The SDQC library can load external model input data directly from Vensim’s .mdl files, PySD translated .py files and user-defined dictionaries containing the required information on the location of each dataset in the data file. The check() function is used to launch the quality check for the three file types.

By default, a pandas.DataFrame will be returned, including information on the data quality issues identified. If verbose=True is passed as argument to the check() function, the information of all performed cheks will be returned, regardless of whether any issue was identified or not.

Warning

Using the default parameters (see section Default configuration), the check() will only identify missing values and check for the monotonicity of series data. To modify the default configurations see Overriding defaults.

Note

For more information on the arguments accepted by the check() function, check the API section or use the help() as follows:

import sdqc

help(sdqc.check)

Running the data quality checks

Using a Vensim model file

To run the default data quality checks directly from a Vensim model ‘.mdl’ file:

import sdqc

model = sdqc.check('my_model.mdl')

Using a PySD model file

To run the default data quality checks from a PySD model file ‘.py’ file:

import sdqc

model = sdqc.check('my_model.py')

Note

The previous two cases require PySD to translate the .mdl file. Using the already translated .py file will naturally be faster.

Note

The PySD library is under constant development and does not yet support the complete list of functions available in Vensim. Therefore, the use of not yet implemented Vensim functions may cause the PySD translator to stop unexpectedly. If that is your case, please use PySD issue tracker on GitHub to report the issue.

Using either a Vensim or PySD model file and a netCDF file containing the external data

The only difference between this and the previous two cases, is that in this particular case the external data will be loaded from a netCDF file (.nc). This has the advantage of being much faster:

import sdqc

model = sdqc.check('my_model.py', externals="externals.nc")

Note

For instructions on how to export the external objects into a netCDF file, please check the PySD documentation.

Using a list of initialized PySD External objects

To be able to run the checks to a set of all the model external data elements, the user can opt to load and initialize those elements beforehand (using the load_data() function), and then pass them to the check() function as follows:

import sdqc
from sdqc.loading import load_data

elements = ['Var1_data', 'Var2_data', 'Var3_data']

external_data = load_data('my_model.mdl', elements_subset=elements)

sdqc.check(external_data, config_file='checks_config.ini')

The path passed to the load_data() function can either point to a Vensim model file or a PySD model file. Similarly, the names of the variables passed as the elements_subset argument can use the original name of the variable in Vensim (original_name), or the names used in Python for the function that calls the external object (py_short_name), or the name of the external object itself (py_name). If the elements_subset is not passed, all the external data elements of the model will be loaded.

Alternatively to the elements_subset argument, the user can pass the files_subset argument, which is a list of the names of the files from which to loaded external data. If the files_subset argument is not passed, variables from all external files in the model will be loaded. See example below:

import sdqc
from sdqc.loading import load_data

files = ['model_parameters/file1.xlsx', 'model_parameters/file2.xlsx']

external_data = load_data('my_model.mdl', files_subset=files)

sdqc.check(external_data, config_file='checks_config.ini')

Note

File paths provided in the files list must be relative to the location of the model file.

Loading and initializing the external data objects beforehand, allows to run the check() function with any combinations of check configurations, report configurations and report formats, with a single load of the required model external data (which is computationally expensive). This is particularly convenient for large models, to separate the checks configurations in multiple separate files. See example below:

import sdqc
from sdqc.loading import load_data

# separate quality check configuration files for outliers, monotonicity and missing values
check_config_files = ['outliers.ini', 'monotonicity.ini', 'missing_values.ini']

# paths of the reports to be generated (note the different report formats)
report_files = ['outliers.html', 'monotonicity.md', 'missing_values.docx']

# configuration of the thresholds that define the severity of the issue
# (e.g. "WARNING", "ERROR", "CRITICAL") depending on the number of issues for
# each type of check
report_config_file = 'report_config.ini'

# load and initialize all model external data objects
external_data = load_data('my_model.py')

# run the data quality checks configured in each configuration file, and
# generate the reports defined in the report_files
for chck, rep in zip(check_config_files, report_files):
    sdqc.check(external_data, config_file=chck, output='report',
               report_config=report_config_file, report_filename=rep,
               verbose=True)

The previous code will load all model external data first, and then run the data quality checks configured in each configuration file, and generate the reports defined in the report_files (in html, markdown and docx, respectively).

Using a user-defined dictionary

Using a dictionary to load the external data is also possible. The key-value pairs required to read LOOKUP, DATA and CONSTANT elements are shown hereafter:

{
  'element_lookup':{
    'type': 'EXTERNAL'
    'excel': [
        {'filename': file_name,
         'sheet': sheet_name,
         'cell': cell_value,
         'x_row_or_cols': x_row_or_cols,
         'subs': None,
         'root': root_to_file}
    ]
  },
  'element_data':{
    'type': 'EXTERNAL'
    'excel': [
        {'filename': file_name,
         'sheet': sheet_name,
         'cell': cell_value,
         'x_row_or_cols': time_row_or_cols,
         'subs': None,
         'root': root_to_file}
    ]
  }
  'element_constant':{
    'type': 'EXTERNAL'
    'excel': [
        {'filename': file_name,
         'sheet': sheet_name,
         'cell': cell_value,
         'x_row_or_cols': None,
         'subs': None,
         'root': root_to_file}
    ]
  }
}

Note

Though an unlimited number of elements may be loaded, the element name given as a key to the dictionary must be unique.

For matrices defined in several lines, the subscript information must be provided in the dictionary as follows:

{
  'element': {
    'type': 'EXTERNAL'
    'excel': [
        {
        'filename': file_name,
        'sheet': sheet_name,
        'cell': cell_value,
        'x_row_or_cols': time_row_or_cols,
        'subs': ['A', 'Dim2'],
        'root': root_to_file
        },
        {
        'filename': file_name,
        'sheet': sheet_nam3,
        'cell': cell_value2,
        'x_row_or_cols': time_row_or_cols,
        'subs': ['B', 'Dim2'],
        'root': root_to_file
        }
      ]
    },
  'Dim1': {
    'type': 'SUBSCRIPT',
    'values': ['A', 'B']
    },
  'Dim2': {
    'type': 'SUBSCRIPT',
    'values': ['C', 'D', 'E']
    }
}

Note

Subscripts must be defined using the subscript range as the key and their values inside a list.

The example below demonstrates how to load data located in Sheet1 of the inputs.xlsx file (located in the current working directory), based on the information available in the dictionary:

 import os
 import sdqc

 _root = os.getcwd()  # ful path to current working directory

 element_dict = {
  'element': {
    'type': 'EXTERNAL'
    'excel': [
        {
        'filename': 'inputs.xlsx',
        'sheet': 'Sheet1',
        'cell': 'B4',
        'x_row_or_cols': '3',
        'subs': ['A', 'Dim2'],
        'root': _root
        },
        {
        'filename': 'inputs.xlsx',
        'sheet': 'Sheet1',
        'cell': 'B7',
        'x_row_or_cols': '3',
        'subs': ['B', 'Dim2'],
        'root': _root
        }
      ]
    },
  'Dim1': {
    'type': 'SUBSCRIPT',
    'values': ['A', 'B']
    },
  'Dim2': {
    'type': 'SUBSCRIPT',
    'values': ['C', 'D', 'E']
    }
}

model = sdqc.check(element_dict)

Note

This example shows how to transform a json with a random structure to a dictionary that uses the structure required by the SDQC library. Several test json files may be found here.

Configuring the data quality checks

Structure of the configuration files

The configuration files used to parametrise the quality checks in SDQC follow the hierarchy shown in the figure below:

_images/hierarchy.png

The types of checks to perform (see section Currently implemented data quality checks for a list of the available checks) may be defined at any of the 3 levels, under the sections with the same names (All, Constants/Dataseries, individual element name) in the configuration file.

Note

Specific elements can be defined using Vensim’s variable names (e.g. Constant Var 1), PySD element names (e.g. constant_var_1) or PySD object names (e.g. _ext_constant_constant_var_1).

Configurations defined for child element/s override the configuration of the parent, for that/those particular element/s. See the Overriding defaults and the Examples sections for more details on the hierarchy.

Default configuration

The default configuration file (default-conf.ini), located in the project’s main folder, contains the following parameters:

[All]
# MISSING VALUES
# check for missing values
missing = yes
# completness of missing values (see documentation of missing_values_data)
completeness = any

# OUTLIERS
# check for possible outliers
outliers = no
# method to use (see documentation of outlier_values)
outliers_method = std
# number of standard deviations to detect outliers (std method)
outliers_nstd = 2
# number of interquartile ranges to detect outliers (iqr method)
outliers_niqr = 1.5

# MONOTONY
# check series monotony
series_monotony = yes

# RANGE
# series range
series_range = false
# min, max values
series_range_values = -1e350 1e350

# INCREMENT
# series increment type
series_increment = no
series_increment_type = linear

Note

The only checks activated in the default configuration are the detection of missing values and the monotonicity for ALL elements (note that all configurations are defined under the [All] section).

Note

To see the list of data quality checks that you can add to the configuration file, see the Currently implemented data quality checks section.

Overriding defaults

To override any (or all) of the default configurations set in the default-conf.ini, users are encouraged to set the desired values in a new .ini file (following the structure defined in Structure of the configuration file section).

Note

If for one or more elements, any of the potential configuration parameters are not set, then the value of that parameter in the default configuration file will be used.

For the new configurations to take effect, the path to the new .ini file must be passed as an argument to the check() function:

import sdqc

model = sdqc.check('my_model.mdl', 'my_cofig.ini')

Examples

Note

This section assumes that the user has created a new .ini configuration file, whose path will be passed to the check() function to override the default configurations from the default-conf.ini file.

To assign specific data quality checks to a certain individual element, users may add the following section:

[Constant Var 1]
missing = true
outliers = false

In the example below, Constant Var 1 does not declare the outliers_method, and so it will inherit the value of this parameter from the Constants section (parent). Also, and even if the parent section (Constants) says otherwise, Constant Var 1 will not be checked for missing values (missing=false). Similarly, the outlier detection method (outlier_method) for all Constants will be iqr (inter quartile range), even if the parent section (All) defines it as std (Standard deviation).:

[All]
outliers_method = std

[Dataseries]
outliers = true

[Constants]
outliers_method = iqr
missing = true

[Constant Var 1]
missing = false
outliers = true