Loading CSV data¶
csv_load() streams rows from a CSV source into
either plain dictionaries or typed dataclass instances. Rows are produced
lazily, so even very large files are handled a row at a time, and blank lines
are skipped.
Reading from different sources¶
The source can be anything in TextProvider: a
single string, a Path, an already-open text stream, or a list
of lines. All four behave identically:
from io import StringIO
from pathlib import Path
from sevaht_utility.parsing import csv_load
csv_load("name,score\nAda,95") # one string
csv_load(["name,score", "Ada,95"]) # list of lines
csv_load(Path("people.csv")) # a file path
csv_load(StringIO("name,score\nAda,95")) # an open stream
Each returns a lazy iterator; wrap it in list(...) to materialize, or iterate
it directly. The examples below use lists of lines for brevity.
Dictionaries¶
With no dataclass, the first row is the header and each later row becomes a
dict[str, str] (values are left as strings):
list(csv_load(["name,score", "Ada,95", "Linus,88"]))
# [{'name': 'Ada', 'score': '95'}, {'name': 'Linus', 'score': '88'}]
Dataclasses with typed fields¶
Pass a dataclass and each row is constructed into an instance, with every
cell converted to the field’s annotated type:
from dataclasses import dataclass
@dataclass
class Person:
name: str
score: int
list(csv_load(["name,score", "Ada,95"], dataclass=Person))
# [Person(name='Ada', score=95)]
Converting cell values¶
Built-in types and unions¶
str, int, float, and bool are converted out of the box. A union is tried
left to right, so annotate a column that holds mixed data accordingly. With
int | str, numeric cells become int and the rest stay str:
@dataclass
class Item:
id: int | str
quantity: int
list(csv_load(["id,quantity", "7,3", "abc,5"], dataclass=Item))
# [Item(id=7, quantity=3), Item(id='abc', quantity=5)]
Booleans¶
A bool field accepts 1, true, or yes (case-insensitively) as true;
everything else is false (see parse_bool()):
@dataclass
class Flag:
name: str
enabled: bool
list(csv_load(["name,enabled", "wifi,Yes", "bt,0"], dataclass=Flag))
# [Flag(name='wifi', enabled=True), Flag(name='bt', enabled=False)]
Custom types with from_string¶
Give a type a from_string classmethod and it is used automatically to convert
that field’s cells:
@dataclass
class Temperature:
celsius: float
@classmethod
def from_string(cls, value: str) -> "Temperature":
return cls(float(value.removesuffix("C")))
@dataclass
class Reading:
label: str
temp: Temperature
list(csv_load(["label,temp", "noon,21.5C"], dataclass=Reading))
# [Reading(label='noon', temp=Temperature(celsius=21.5))]
Registering a converter for a type you do not own¶
When you cannot add from_string to a type (it is third-party, or you want
different behavior per load), register a converter on a
StringParser and pass it via
CsvLoadOptions:
from sevaht_utility.parsing import CsvLoadOptions, StringParser
parser = StringParser()
parser.set_converter(complex, converter=complex) # built-in complex()
@dataclass
class Signal:
name: str
value: complex
list(csv_load(["name,value", "a,1+2j"], dataclass=Signal,
options=CsvLoadOptions(string_parser=parser)))
# [Signal(name='a', value=(1+2j))]
Computing fields per row¶
Derived fields with InitVar and __post_init__¶
Use InitVar for cells that feed __post_init__ but are not stored directly,
and field(init=False, ...) for values computed from them. Each InitVar is
matched to a column like any other field:
from dataclasses import dataclass, field, InitVar
@dataclass
class Scores:
name: str
total: int = field(init=False, default=0)
first: InitVar[int] = 0
second: InitVar[int] = 0
def __post_init__(self, first: int, second: int) -> None:
self.total = first + second
list(csv_load(["name,first,second", "Ada,3,4"], dataclass=Scores))
# [Scores(name='Ada', total=7)]
A custom factory with init_function¶
Supply init_function to build each row yourself instead of calling the
constructor directly. Its parameter names are matched to columns and its
annotations drive conversion. Pair it with dataclass= so the result is typed:
@dataclass
class Person:
name: str
score: int
def make_person(name: str, score: int) -> Person:
return Person(name=name.title(), score=score + 100) # bonus points
list(csv_load(["name,score", "ada,95"], dataclass=Person,
init_function=make_person))
# [Person(name='Ada', score=195)]
In dict mode (no dataclass), an init_function returns a dict; its annotated
parameters still drive conversion, which lets you add derived keys:
def to_record(name: str, score: int) -> dict[str, object]:
return {"name": name, "score": score, "passed": score >= 50}
list(csv_load(["name,score", "Ada,40"], init_function=to_record))
# [{'name': 'Ada', 'score': 40, 'passed': False}]
Default values for absent columns¶
Fields with defaults need not appear in the CSV; missing columns simply keep their default:
@dataclass
class Config:
host: str
port: int = 8080
list(csv_load(["host", "example.com"], dataclass=Config))
# [Config(host='example.com', port=8080)]
Reader options¶
CsvLoadOptions and
DataMapping cover the how and the what of a
load, respectively.
A different delimiter¶
Set delimiter for tab- or pipe-separated data:
list(csv_load(["a\tb", "1\t2"], options=CsvLoadOptions(delimiter="\t")))
# [{'a': '1', 'b': '2'}]
Data with no header row¶
Provide column_names to name the columns positionally; every row is then
treated as data:
from sevaht_utility.parsing import DataMapping
list(csv_load(["1,2,3", "4,5,6"],
mapping=DataMapping(column_names=["a", "b", "c"])))
# [{'a': '1', 'b': '2', 'c': '3'}, {'a': '4', 'b': '5', 'c': '6'}]
Requiring every column to be used¶
By default, columns that match no field are ignored. Set
allow_column_subset=False to raise
UnconsumedColumnsError instead when a column goes
unused:
mapping = DataMapping(field_to_column_name={"x": "a"})
options = CsvLoadOptions(allow_column_subset=False)
list(csv_load(["a,b", "1,2"], mapping=mapping, options=options))
# UnconsumedColumnsError: 1 columns were not consumed: b
Mapping columns to fields for edge-case names¶
By default a column feeds the field with the same name. When the header text
does not line up with your field names, describe the mapping with a
DataMapping. When several rules could apply the
precedence is, highest first:
field_to_column_indexfield_to_column_namefield / parameter names
dataclass field metadata
raw column names (dict mode)
Differently-cased headers¶
Real exports often use camelCase or PascalCase headers while your fields
are snake_case. Set name_style and both sides are normalized before
matching (see Working with identifier names).
Both the source columns and the destination field names are converted to
name_style before they are compared. Normalizing the destination names is the
important part: it means your dataclass can keep idiomatic PEP 8 snake_case
members regardless of how the file is cased. You do not rename your fields to
match the header — you name them properly and let the comparison happen in a
common style.
from sevaht_utility.naming import NameStyle
@dataclass
class Person:
full_name: str
score_value: int
rows = ["fullName,scoreValue", "Ada,95"]
list(csv_load(rows, dataclass=Person,
mapping=DataMapping(name_style=NameStyle.CAMEL_CASE)))
# [Person(full_name='Ada', score_value=95)]
A header that differs per field (metadata)¶
When only one field has an awkward header, annotate just that field. By default
the metadata key is csv_key (configurable via
CsvLoadOptions):
@dataclass
class Row:
identifier: int = field(metadata={"csv_key": "ID"})
label: str = ""
list(csv_load(["ID,label", "7,hello"], dataclass=Row))
# [Row(identifier=7, label='hello')]
Arbitrary header text¶
For headers that share nothing with the field names, map them explicitly with
field_to_column_name:
@dataclass
class Account:
user_id: int
balance: float
mapping = DataMapping(field_to_column_name={"user_id": "acct#", "balance": "$$$"})
list(csv_load(["acct#,$$$", "42,9.99"], dataclass=Account, mapping=mapping))
# [Account(user_id=42, balance=9.99)]
Duplicate or ambiguous headers¶
When two columns share a name (or normalize to the same name), matching by name
is ambiguous and raises
AmbiguousColumnNamesError. Disambiguate by
pinning fields to explicit zero-based column indices, the highest-precedence
rule:
@dataclass
class Pair:
first_a: int
second_a: int
mapping = DataMapping(
field_to_column_name={"first_a": "a", "second_a": "a"},
field_to_column_index={"first_a": 0, "second_a": 1},
)
list(csv_load(["a,a,b", "1,2,3"], dataclass=Pair, mapping=mapping))
# [Pair(first_a=1, second_a=2)]
This is the recommended escape hatch for the run-together acronym names that
split_into_words() intentionally leaves merged: map
the column directly rather than relying on the splitter.
See the API reference for the full set of options and the exceptions each mismatch raises.