Report Generation

SDQC reports allow to classify the identified issues by severity (WARNING, ERROR and CRITICAL), to generate some statistics about their number and type and to export the result to a file.

Currently supported report file formats are html (default), markdown (github), docx and PDF.

Warning

Generating markdown and pdf reports requires the pandoc package. Additionally, to build pdf reports, the Tex Live and XeTeX packages are required.

Note

SDQC has been tested with Pandoc >= 2.17. Pandoc versions below 2.14 do not support the rowspan attribute in html <th> tags and will not produce adequate results for report formats other than html. Note that when installing pypandoc with conda, it will already install Pandoc’s latest version. When installing with pip, Pandoc will not be installed automatically, but it’s latest version may be installed using pypandoc.pandoc_download(). For more details, please check pypandoc documentation.

To generate reports with default configurations and format, use the following command:

sdqc.check('my_model.mdl', output='report', verbose=True)

If the argument verbose is set to True, a summary with more detailed data quality statistics will be added to the report.

Overriding default report configuration

The default configuration for the reports is given in the default-report-conf.json file.

In order to override the defaults, the user can copy and rename the default JSON file and set new values. Then, the path to the new configuration file can be passed as an argument as follows:

sdqc.check('my_model.mdl', output='report', report_config='my-report-conf.json', verbose=True)

Configuration files must follow the same structure than the default one, and only include the fields that the user wants to override. The available configuration fields and their values in the default configuration file are:

  • missing_values_series: {“WARNING”: 20, “ERROR”: 30, “CRITICAL”: 50},

  • outlier_values: {“WARNING”: 20, “ERROR”: 30, “CRITICAL”: 50},

  • missing_values: {“WARNING”: 20, “ERROR”: 30, “CRITICAL”: 50},

  • missing_values_data: {“WARNING”: 20, “ERROR”: 30, “CRITICAL”: 50},

  • series_monotony: “CRITICAL”,

  • series_range: “ERROR”,

  • series_increment_type: “ERROR”

Note

missing_values_series, outlier_values, missing_values and missing_values_data only accept objects of three integer values between 0 and 100, corresponding to the tolerance associated to WARNING, ERROR and CRITICAL. For instance, in the case of missing_values, if the value is set to {'WARNING': 20, 'ERROR': 30, 'CRITICAL': 50}, it means that if the data contain less than 20% of missing values, the issue will be graded as a WARNING, while a grading of ERROR and CRITICAL will be given for more than 30 or 50% of missing values, respectively.

Note

series_monotony, series_range and series_increment_type only accept strings corresponding to the severity given to the issues of each of these types.

Generating multiple reports at once

The methods build_report and report_to_file of the Reports class allow to pass alternative report configurations and outpult files (and formats) for the same quality check results. This way, users can run the quality checks just once, and generate any number of reports in any of the suported formats.

To do that, the user must first create a Reports object, by passing the argument output=’report’ to the check function as described in the first section. Then call the build_report and report_to_file as many times as different report configurations (json files or dictionaries) are defined:

# build a report using the report_conf.json report configuration file and save it to the file report.html
report_obj = sdqc.check('my_model.mdl', output='report', report_config='report_conf.json', report_filename='report.html', verbose=True)

# updating report configuration for missing values:
updated_report_config = {"missing_values_series": {"WARNING": 20, "ERROR": 40, "CRITICAL": 50}}

# update report using the new configuration (updated_report_config dictionary)
report_obj.build_report(updated_report_config)

# save the new report to a markdown file (report.md)
report_obj.report_to_file(report_filename='report.md')

# save the new report to a docx file (report.docx)
report_obj.report_to_file(report_filename='report.docx')

The previous code will generate a report in html, and then two additional reports, using a different report configuration, in md and docx.