Skip to content

Define Data Schemas

Let's start using the TDD methodology in a more concrete way, to see how data schemas can be defined using pydantic.

Data Objects

Data structures are the fundamental constructs around which we build our programs. Each data structure provides a particular way of organizing data so it can be accessed efficiently, depending on the use case. Python ships with an extensive set of data structures in its standard library: lists, tuples, dictionaries, dataclasses, etc.

Classes allow you to define reusable blueprints for data objects to ensure each object provides the same set of fields.

Using regular Python classes as record data types is feasible, but it also takes manual work to get the convenience features of other implementations. For example, adding new fields to the __init__ constructor is verbose and takes time.

Writing a custom class is a great option whenever you’d like to add business logic and behavior to your record objects using methods. However, this means that these objects are technically no longer plain data objects.

Data classes are available in Python 3.7 and above. They provide an excellent alternative to defining your own data storage classes from scratch.

And pydantic is built on top of dataclasses (TB confirmed).

Pydantic

From Pydantic website:

Data validation and settings management using python type annotations.

pydantic enforces type hints at runtime, and provides user friendly errors when data is invalid.

Define how data should be in pure, canonical python; validate it with pydantic.

Example

We want a kind of object, capable to store structured information, and independent of the database engine.

Let's define some tests to create our first Schema.

First of all, let's create a new test file:

touch tests/test_01_data_schemas.py 

And let's define what do I want to do: - I want to have a data schema to define books sold in an online store. - A book has a title, an author, a price, and a publish year and (this is optional).

Define the assumptions I want to check. This code won't work, because the Book data schema is not defined. But writing the tests helps us to think about what fields are needed.

tests/test_01_data_schemas.py

+import pytest
+from pydantic import ValidationError
+
+
+def test_create_book_with_valid_schema_successfully() -> None:
+    book = Book(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856)
+    assert book.title == "This is a nice book"
+    assert book.author == "Jane Appleseed"
+    assert book.year == 1856
+    assert book.price == "Jane Appleseed"
+
+
+def test_create_book_with_invalid_schema_raises_exception() -> None:
+    with pytest.raises(ValidationError):
+        Book(author="Lewis Carroll", price=18.99),

At this point, I can define a data class for the books.

tests/test_01_data_schemas.py

@@ -1,12 +1,21 @@

+from typing import Optional
+
 import pytest
-from pydantic import ValidationError
+from pydantic import BaseModel, constr, confloat, ValidationError
+
+
+class Book(BaseModel):
+    title: str
+    author: constr(min_length=1, regex="[A-Za-z ,.'-]+$")
+    price: confloat(gt=0)
+    year: Optional[int]


 def test_create_book_with_valid_schema_successfully() -> None:
     book = Book(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856)
     assert book.title == "This is a nice book"
     assert book.author == "Jane Appleseed"
     assert book.year == 1856
     assert book.price == "Jane Appleseed"

Some things worth to mention here about the field data types:

  • author has been defined as constr instead of str. Pydantic defins numerous commont ypes as "Constrained Types" and allow to add validations at the definition level. In this case I defined a minimum length and that the input value must match the regular expression.
  • The same happens with price, defined as confloat. In this case, the float value must be positive.
  • And finally, the year, which is defined as an optional integer.

More information about the validation types can be found in the pydantic page about "Field Types"

Improving Tests: Parametrize

In the next step, I'll use a new concept very useful to write tests using pytest:

Parametrization.

parametrization allows to define a set of arguments and fixtures to be run for the same test function, facilitating the test repetition using multiple and different values. Let's see in our code:

tests/test_01_data_schemas.py

@@ -4,21 +4,44 @@

 from pydantic import BaseModel, constr, confloat, ValidationError


 class Book(BaseModel):
     title: str
     author: constr(min_length=1, regex="[A-Za-z ,.'-]+$")
     price: confloat(gt=0)
     year: Optional[int]


-def test_create_book_with_valid_schema_successfully() -> None:
-    book = Book(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856)
-    assert book.title == "This is a nice book"
-    assert book.author == "Jane Appleseed"
-    assert book.year == 1856
-    assert book.price == "Jane Appleseed"
+@pytest.mark.parametrize(
+    "valid_schema", (
+        dict(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856),
+        dict(title="This is another nice book", author="John Smith", price=18.99)
+    ),
+)
+def test_create_book_with_valid_schema_successfully(valid_schema: dict) -> None:
+    book = Book(
+        title=valid_schema.get("title"),
+        author=valid_schema.get("author"),
+        year=valid_schema.get("year"),
+        price=valid_schema.get("price"),
+    )
+    assert book.title == valid_schema.get("title")
+    assert book.author == valid_schema.get("author")
+    assert book.year == valid_schema.get("year")
+    assert book.price == valid_schema.get("price")


-def test_create_book_with_invalid_schema_raises_exception() -> None:
+@pytest.mark.parametrize(
+    "invalid_schema", (
+        dict(author="Lewis Carroll", price=18.99),
+        dict(title="Another nice book", author="Lewis Carroll"),
+        dict(title="Lorem ipsum", author="Lewis Carroll", price=-10.0)
+    ),
+)
+def test_create_book_with_invalid_schema_raises_exception(invalid_schema: dict) -> None:
     with pytest.raises(ValidationError):
-        Book(author="Lewis Carroll", price=18.99),
+        Book(
+            title=invalid_schema.get("title"),
+            author=invalid_schema.get("author"),
+            year=invalid_schema.get("year"),
+            price=invalid_schema.get("price"),
+        )

What I did here is to add a code like this:

@pytest.mark.parametrize(
    "valid_schema", (
        dict(title="This is a nice book", author="Jane Appleseed", price=18.99, year=1856),
        dict(title="This is another nice book", author="John Smith", price=18.99)
    ),
)

at the beginning of the test function, and redefined the function signature to:

def test_create_book_with_valid_schema_successfully(valid_schema: dict) -> None:

adding the valid_schema parameter, that will get all the parametrized values defined previously.

With those changes, adding a few lines, I can run the same tests with many values as I need. In this case, running the tests with 2 test functions:

pytest tests/test_01_data_schemas.py -v

a total of 5 tests are executed:

======================================= test session starts ========================================
collected 5 items                                                                                  

tests/test_01_data_schemas.py::test_create_book_with_valid_schema_successfully[valid_schema0] PASSED [ 20%]
tests/test_01_data_schemas.py::test_create_book_with_valid_schema_successfully[valid_schema1] PASSED [ 40%]
tests/test_01_data_schemas.py::test_create_book_with_invalid_schema_raises_exception[invalid_schema0] PASSED [ 60%]
tests/test_01_data_schemas.py::test_create_book_with_invalid_schema_raises_exception[invalid_schema1] PASSED [ 80%]
tests/test_01_data_schemas.py::test_create_book_with_invalid_schema_raises_exception[invalid_schema2] PASSED [100%]

======================================== 5 passed in 0.07s =========================================

All green!