Write

Drop three files in a temp directory, build a DataFrame with name, path, and any extras you want to ride along (here split, label, and a GeoParquet geometry column), hand it to Cozip.create.

using DataFrames
using JSON3
using ArchGDAL
using Cozip
const AG = ArchGDAL

# three tmp files with anything inside
tmp = mktempdir()
paths = String[]
for i in 0:2
    p = joinpath(tmp, "file_$(i).bin")
    write(p, repeat("file $i contents\n", 100))
    push!(paths, p)
end

# GeoParquet metadata so viewers recognize the geometry column
geo = Dict(
    "version"        => "1.1.0",
    "primary_column" => "geometry",
    "columns"        => Dict("geometry" => Dict("encoding" => "WKB", "geometry_types" => ["Polygon"])),
)

# bounding box polygon as WKB
function bbox_wkb(xmin, ymin, xmax, ymax)
    poly = AG.createpolygon([(xmin, ymin), (xmax, ymin), (xmax, ymax), (xmin, ymax), (xmin, ymin)])
    AG.toWKB(poly)
end

df = DataFrame(
    name     = basename.(paths),
    path     = paths,
    split    = ["train", "val", "train"],
    label    = ["zeros", "ones", "twos"],
    geometry = [
        bbox_wkb(-77.0, -12.1, -76.9, -12.0),
        bbox_wkb(-76.9, -12.1, -76.8, -12.0),
        bbox_wkb(-76.8, -12.1, -76.7, -12.0),
    ],
)
metadata!(df, "geo", JSON3.write(geo); style=:note)

archive = joinpath(tmp, "dataset.zip")
Cozip.create(archive, df)

Read

Cozip.read returns the manifest as a DataFrame, your custom columns included. Filter however you like, then open the matching files in place with seek and read using the offset and size the writer added.

using Cozip

df = Cozip.read(archive)
println(df)

# filter on your own columns, then read the matching files in place
trains = df[df.split .== "train", :]
open(archive, "r") do f
    for row in eachrow(trains)
        seek(f, row.offset)
        data = read(f, row.size)
        println(row.name, " ", length(data), " bytes")
    end
end

Publish

The archive is a plain ZIP so any S3-compatible bucket works. The example below uses asterisk-labs/cozip on Source Cooperative.

# Source Coop creds are temporary STS. Export them in your shell, then upload.
export AWS_ACCESS_KEY_ID="<your-key>"
export AWS_SECRET_ACCESS_KEY="<your-secret>"
export AWS_SESSION_TOKEN="<your-session-token>"

aws s3 cp dataset.zip \
    s3://us-west-2.opendata.source.coop/asterisk-labs/cozip/dataset.zip \
    --region us-west-2

Back in Julia, the read flow is the same with a URL in place of the path.

using Cozip

url = "https://data.source.coop/asterisk-labs/cozip/dataset.zip"
println(Cozip.read(url))

Explore in the playground

Once the archive is public the cozip playground reads it from the browser. No Julia, no install, no download. The manifest renders as a table, and any file is one click from a copyable URL.

Open the playground and paste the URL below into the input.

https://data.source.coop/asterisk-labs/cozip/dataset.zip