Define Data Schemas
Let's start using the TDD methodology in a more concrete way, to see how data schemas can be defined using pydantic.
Data Objects
Data structures are the fundamental constructs around which we build our programs. Each data structure provides a particular way of organizing data so it can be accessed efficiently, depending on the use case. Python ships with an extensive set of data structures in its standard library: lists, tuples, dictionaries, dataclasses, etc.
Classes allow you to define reusable blueprints for data objects to ensure each object provides the same set of fields.
Using regular Python classes as record data types is feasible, but it also takes manual work to get the convenience
features of other implementations. For example, adding new fields to the __init__
constructor is verbose and takes time.
Writing a custom class is a great option whenever you’d like to add business logic and behavior to your record objects using methods. However, this means that these objects are technically no longer plain data objects.
Data classes are available in Python 3.7 and above. They provide an excellent alternative to defining your own data storage classes from scratch.
And pydantic is built on top of dataclasses (TB confirmed).
Pydantic
From Pydantic website:
Data validation and settings management using python type annotations.
pydantic enforces type hints at runtime, and provides user friendly errors when data is invalid.
Define how data should be in pure, canonical python; validate it with pydantic.
Example
We want a kind of object, capable to store structured information, and independent of the database engine.
Let's define some tests to create our first Schema.
First of all, let's create a new test file:
touch tests/test_01_data_schemas.py
And let's define what do I want to do: - I want to have a data schema to define books sold in an online store. - A book has a title, an author, a price, and a publish year and (this is optional).
Define the assumptions I want to check. This code won't work, because the Book data schema is not defined. But writing the tests helps us to think about what fields are needed.
tests/test_01_data_schemas.py
+import pytest
+from pydantic import ValidationError
+
+
+def test_create_book_with_valid_schema_successfully() -> None:
+ book = Book(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856)
+ assert book.title == "This is a nice book"
+ assert book.author == "Jane Appleseed"
+ assert book.year == 1856
+ assert book.price == "Jane Appleseed"
+
+
+def test_create_book_with_invalid_schema_raises_exception() -> None:
+ with pytest.raises(ValidationError):
+ Book(author="Lewis Carroll", price=18.99),
At this point, I can define a data class for the books.
tests/test_01_data_schemas.py
@@ -1,12 +1,21 @@
+from typing import Optional
+
import pytest
-from pydantic import ValidationError
+from pydantic import BaseModel, constr, confloat, ValidationError
+
+
+class Book(BaseModel):
+ title: str
+ author: constr(min_length=1, regex="[A-Za-z ,.'-]+$")
+ price: confloat(gt=0)
+ year: Optional[int]
def test_create_book_with_valid_schema_successfully() -> None:
book = Book(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856)
assert book.title == "This is a nice book"
assert book.author == "Jane Appleseed"
assert book.year == 1856
assert book.price == "Jane Appleseed"
Some things worth to mention here about the field data types:
author
has been defined asconstr
instead ofstr
. Pydantic defins numerous commont ypes as "Constrained Types" and allow to add validations at the definition level. In this case I defined a minimum length and that the input value must match the regular expression.- The same happens with
price
, defined asconfloat
. In this case, the float value must be positive. - And finally, the
year
, which is defined as an optional integer.
More information about the validation types can be found in the pydantic page about "Field Types"
Improving Tests: Parametrize
In the next step, I'll use a new concept very useful to write tests using pytest:
parametrization allows to define a set of arguments and fixtures to be run for the same test function, facilitating the test repetition using multiple and different values. Let's see in our code:
tests/test_01_data_schemas.py
@@ -4,21 +4,44 @@
from pydantic import BaseModel, constr, confloat, ValidationError
class Book(BaseModel):
title: str
author: constr(min_length=1, regex="[A-Za-z ,.'-]+$")
price: confloat(gt=0)
year: Optional[int]
-def test_create_book_with_valid_schema_successfully() -> None:
- book = Book(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856)
- assert book.title == "This is a nice book"
- assert book.author == "Jane Appleseed"
- assert book.year == 1856
- assert book.price == "Jane Appleseed"
+@pytest.mark.parametrize(
+ "valid_schema", (
+ dict(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856),
+ dict(title="This is another nice book", author="John Smith", price=18.99)
+ ),
+)
+def test_create_book_with_valid_schema_successfully(valid_schema: dict) -> None:
+ book = Book(
+ title=valid_schema.get("title"),
+ author=valid_schema.get("author"),
+ year=valid_schema.get("year"),
+ price=valid_schema.get("price"),
+ )
+ assert book.title == valid_schema.get("title")
+ assert book.author == valid_schema.get("author")
+ assert book.year == valid_schema.get("year")
+ assert book.price == valid_schema.get("price")
-def test_create_book_with_invalid_schema_raises_exception() -> None:
+@pytest.mark.parametrize(
+ "invalid_schema", (
+ dict(author="Lewis Carroll", price=18.99),
+ dict(title="Another nice book", author="Lewis Carroll"),
+ dict(title="Lorem ipsum", author="Lewis Carroll", price=-10.0)
+ ),
+)
+def test_create_book_with_invalid_schema_raises_exception(invalid_schema: dict) -> None:
with pytest.raises(ValidationError):
- Book(author="Lewis Carroll", price=18.99),
+ Book(
+ title=invalid_schema.get("title"),
+ author=invalid_schema.get("author"),
+ year=invalid_schema.get("year"),
+ price=invalid_schema.get("price"),
+ )
What I did here is to add a code like this:
@pytest.mark.parametrize(
"valid_schema", (
dict(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856),
dict(title="This is another nice book", author="John Smith", price=18.99)
),
)
at the beginning of the test function, and redefined the function signature to:
def test_create_book_with_valid_schema_successfully(valid_schema: dict) -> None:
adding the valid_schema
parameter, that will get all the parametrized values defined previously.
With those changes, adding a few lines, I can run the same tests with many values as I need. In this case, running the tests with 2 test functions:
pytest tests/test_01_data_schemas.py -v
a total of 5 tests are executed:
======================================= test session starts ========================================
collected 5 items
tests/test_01_data_schemas.py::test_create_book_with_valid_schema_successfully[valid_schema0] PASSED [ 20%]
tests/test_01_data_schemas.py::test_create_book_with_valid_schema_successfully[valid_schema1] PASSED [ 40%]
tests/test_01_data_schemas.py::test_create_book_with_invalid_schema_raises_exception[invalid_schema0] PASSED [ 60%]
tests/test_01_data_schemas.py::test_create_book_with_invalid_schema_raises_exception[invalid_schema1] PASSED [ 80%]
tests/test_01_data_schemas.py::test_create_book_with_invalid_schema_raises_exception[invalid_schema2] PASSED [100%]
======================================== 5 passed in 0.07s =========================================
All green!