# Loading CSV data {func}`~sevaht_utility.parsing.csv_load` streams rows from a CSV source into either plain dictionaries or typed dataclass instances. Rows are produced lazily, so even very large files are handled a row at a time, and blank lines are skipped. ## Reading from different sources The `source` can be anything in {data}`~sevaht_utility.parsing.TextProvider`: a single string, a {class}`~pathlib.Path`, an already-open text stream, or a list of lines. All four behave identically: ```python from io import StringIO from pathlib import Path from sevaht_utility.parsing import csv_load csv_load("name,score\nAda,95") # one string csv_load(["name,score", "Ada,95"]) # list of lines csv_load(Path("people.csv")) # a file path csv_load(StringIO("name,score\nAda,95")) # an open stream ``` Each returns a lazy iterator; wrap it in `list(...)` to materialize, or iterate it directly. The examples below use lists of lines for brevity. ## Dictionaries With no `dataclass`, the first row is the header and each later row becomes a `dict[str, str]` (values are left as strings): ```python list(csv_load(["name,score", "Ada,95", "Linus,88"])) # [{'name': 'Ada', 'score': '95'}, {'name': 'Linus', 'score': '88'}] ``` ## Dataclasses with typed fields Pass a `dataclass` and each row is constructed into an instance, with every cell converted to the field's annotated type: ```python from dataclasses import dataclass @dataclass class Person: name: str score: int list(csv_load(["name,score", "Ada,95"], dataclass=Person)) # [Person(name='Ada', score=95)] ``` ## Converting cell values ### Built-in types and unions `str`, `int`, `float`, and `bool` are converted out of the box. A union is tried left to right, so annotate a column that holds mixed data accordingly. With `int | str`, numeric cells become `int` and the rest stay `str`: ```python @dataclass class Item: id: int | str quantity: int list(csv_load(["id,quantity", "7,3", "abc,5"], dataclass=Item)) # [Item(id=7, quantity=3), Item(id='abc', quantity=5)] ``` ### Booleans A `bool` field accepts `1`, `true`, or `yes` (case-insensitively) as true; everything else is false (see {func}`~sevaht_utility.parsing.parse_bool`): ```python @dataclass class Flag: name: str enabled: bool list(csv_load(["name,enabled", "wifi,Yes", "bt,0"], dataclass=Flag)) # [Flag(name='wifi', enabled=True), Flag(name='bt', enabled=False)] ``` ### Custom types with `from_string` Give a type a `from_string` classmethod and it is used automatically to convert that field's cells: ```python @dataclass class Temperature: celsius: float @classmethod def from_string(cls, value: str) -> "Temperature": return cls(float(value.removesuffix("C"))) @dataclass class Reading: label: str temp: Temperature list(csv_load(["label,temp", "noon,21.5C"], dataclass=Reading)) # [Reading(label='noon', temp=Temperature(celsius=21.5))] ``` ### Registering a converter for a type you do not own When you cannot add `from_string` to a type (it is third-party, or you want different behavior per load), register a converter on a {class}`~sevaht_utility.parsing.StringParser` and pass it via {class}`~sevaht_utility.parsing.CsvLoadOptions`: ```python from sevaht_utility.parsing import CsvLoadOptions, StringParser parser = StringParser() parser.set_converter(complex, converter=complex) # built-in complex() @dataclass class Signal: name: str value: complex list(csv_load(["name,value", "a,1+2j"], dataclass=Signal, options=CsvLoadOptions(string_parser=parser))) # [Signal(name='a', value=(1+2j))] ``` ## Computing fields per row ### Derived fields with `InitVar` and `__post_init__` Use `InitVar` for cells that feed `__post_init__` but are not stored directly, and `field(init=False, ...)` for values computed from them. Each `InitVar` is matched to a column like any other field: ```python from dataclasses import dataclass, field, InitVar @dataclass class Scores: name: str total: int = field(init=False, default=0) first: InitVar[int] = 0 second: InitVar[int] = 0 def __post_init__(self, first: int, second: int) -> None: self.total = first + second list(csv_load(["name,first,second", "Ada,3,4"], dataclass=Scores)) # [Scores(name='Ada', total=7)] ``` ### A custom factory with `init_function` Supply `init_function` to build each row yourself instead of calling the constructor directly. Its parameter names are matched to columns and its annotations drive conversion. Pair it with `dataclass=` so the result is typed: ```python @dataclass class Person: name: str score: int def make_person(name: str, score: int) -> Person: return Person(name=name.title(), score=score + 100) # bonus points list(csv_load(["name,score", "ada,95"], dataclass=Person, init_function=make_person)) # [Person(name='Ada', score=195)] ``` In dict mode (no `dataclass`), an `init_function` returns a dict; its annotated parameters still drive conversion, which lets you add derived keys: ```python def to_record(name: str, score: int) -> dict[str, object]: return {"name": name, "score": score, "passed": score >= 50} list(csv_load(["name,score", "Ada,40"], init_function=to_record)) # [{'name': 'Ada', 'score': 40, 'passed': False}] ``` ### Default values for absent columns Fields with defaults need not appear in the CSV; missing columns simply keep their default: ```python @dataclass class Config: host: str port: int = 8080 list(csv_load(["host", "example.com"], dataclass=Config)) # [Config(host='example.com', port=8080)] ``` ## Reader options {class}`~sevaht_utility.parsing.CsvLoadOptions` and {class}`~sevaht_utility.parsing.DataMapping` cover the *how* and the *what* of a load, respectively. ### A different delimiter Set `delimiter` for tab- or pipe-separated data: ```python list(csv_load(["a\tb", "1\t2"], options=CsvLoadOptions(delimiter="\t"))) # [{'a': '1', 'b': '2'}] ``` ### Data with no header row Provide `column_names` to name the columns positionally; every row is then treated as data: ```python from sevaht_utility.parsing import DataMapping list(csv_load(["1,2,3", "4,5,6"], mapping=DataMapping(column_names=["a", "b", "c"]))) # [{'a': '1', 'b': '2', 'c': '3'}, {'a': '4', 'b': '5', 'c': '6'}] ``` ### Requiring every column to be used By default, columns that match no field are ignored. Set `allow_column_subset=False` to raise {exc}`~sevaht_utility.parsing.UnconsumedColumnsError` instead when a column goes unused: ```python mapping = DataMapping(field_to_column_name={"x": "a"}) options = CsvLoadOptions(allow_column_subset=False) list(csv_load(["a,b", "1,2"], mapping=mapping, options=options)) # UnconsumedColumnsError: 1 columns were not consumed: b ``` (edge-case-names)= ## Mapping columns to fields for edge-case names By default a column feeds the field with the same name. When the header text does not line up with your field names, describe the mapping with a {class}`~sevaht_utility.parsing.DataMapping`. When several rules could apply the precedence is, highest first: 1. `field_to_column_index` 2. `field_to_column_name` 3. field / parameter names 4. dataclass field metadata 5. raw column names (dict mode) ### Differently-cased headers Real exports often use `camelCase` or `PascalCase` headers while your fields are `snake_case`. Set `name_style` and both sides are normalized before matching (see {doc}`naming`). Both the source columns *and* the destination field names are converted to `name_style` before they are compared. Normalizing the destination names is the important part: it means your dataclass can keep idiomatic PEP 8 `snake_case` members regardless of how the file is cased. You do not rename your fields to match the header — you name them properly and let the comparison happen in a common style. ```python from sevaht_utility.naming import NameStyle @dataclass class Person: full_name: str score_value: int rows = ["fullName,scoreValue", "Ada,95"] list(csv_load(rows, dataclass=Person, mapping=DataMapping(name_style=NameStyle.CAMEL_CASE))) # [Person(full_name='Ada', score_value=95)] ``` ### A header that differs per field (metadata) When only one field has an awkward header, annotate just that field. By default the metadata key is `csv_key` (configurable via {class}`~sevaht_utility.parsing.CsvLoadOptions`): ```python @dataclass class Row: identifier: int = field(metadata={"csv_key": "ID"}) label: str = "" list(csv_load(["ID,label", "7,hello"], dataclass=Row)) # [Row(identifier=7, label='hello')] ``` ### Arbitrary header text For headers that share nothing with the field names, map them explicitly with `field_to_column_name`: ```python @dataclass class Account: user_id: int balance: float mapping = DataMapping(field_to_column_name={"user_id": "acct#", "balance": "$$$"}) list(csv_load(["acct#,$$$", "42,9.99"], dataclass=Account, mapping=mapping)) # [Account(user_id=42, balance=9.99)] ``` ### Duplicate or ambiguous headers When two columns share a name (or normalize to the same name), matching by name is ambiguous and raises {exc}`~sevaht_utility.parsing.AmbiguousColumnNamesError`. Disambiguate by pinning fields to explicit zero-based column indices, the highest-precedence rule: ```python @dataclass class Pair: first_a: int second_a: int mapping = DataMapping( field_to_column_name={"first_a": "a", "second_a": "a"}, field_to_column_index={"first_a": 0, "second_a": 1}, ) list(csv_load(["a,a,b", "1,2,3"], dataclass=Pair, mapping=mapping)) # [Pair(first_a=1, second_a=2)] ``` This is the recommended escape hatch for the run-together acronym names that {func}`~sevaht_utility.naming.split_into_words` intentionally leaves merged: map the column directly rather than relying on the splitter. See the {doc}`API reference <../reference/parsing>` for the full set of options and the exceptions each mismatch raises.