We Constructed an Open-Supply Information High quality Testframework for PySpark | by Tomer Gabay | Aug, 2024

Measure and report your knowledge high quality with ease

[image by author, generated with Dall-E]

Each knowledge scientist is aware of the basic saying “rubbish in, rubbish out”. Due to this fact it’s important to measure what the standard of your knowledge is.

At Woonstad Rotterdam, a Dutch social housing affiliation, we use PySpark in Databricks for our ETL. Information from our exterior software program suppliers is loaded into our datalake utilizing APIs. Nevertheless, not each software program provider is testing on knowledge high quality. Penalties of defective knowledge within the social housing sector will be important, starting from tenants being unable to use for allowances to rents being set at costs which are unlawful in line with the Reasonably priced Hire Act. Due to this fact, we constructed a knowledge high quality testframework for PySpark DataFrames to have the ability to report about knowledge high quality to the suppliers and the customers of the info.

Penalties of defective knowledge within the social housing sector will be important, starting from tenants being unable to use for allowances to rents being set at costs which are unlawful in line with the Reasonably priced Hire Act.