Quickstart
cozip in Julia.
Pack three files with whatever metadata you want into one cozip archive, read it back, publish it, and open it in the browser.
]registry add https://github.com/asterisk-labs/AsteriskRegistry ]add Cozip DataFrames JSON3 ArchGDALWrite
Drop three files in a temp directory, build a DataFrame with name, path, and any extras you want to ride along (here split, label, and a GeoParquet geometry column), hand it to Cozip.create.
using DataFrames using JSON3 using ArchGDAL using Cozip const AG = ArchGDAL # three tmp files with anything inside tmp = mktempdir() paths = String[] for i in 0:2 p = joinpath(tmp, "file_$(i).bin") write(p, repeat("file $i contents\n", 100)) push!(paths, p) end # GeoParquet metadata so viewers recognize the geometry column geo = Dict( "version" => "1.1.0", "primary_column" => "geometry", "columns" => Dict("geometry" => Dict("encoding" => "WKB", "geometry_types" => ["Polygon"])), ) # bounding box polygon as WKB function bbox_wkb(xmin, ymin, xmax, ymax) poly = AG.createpolygon([(xmin, ymin), (xmax, ymin), (xmax, ymax), (xmin, ymax), (xmin, ymin)]) AG.toWKB(poly) end df = DataFrame( name = basename.(paths), path = paths, split = ["train", "val", "train"], label = ["zeros", "ones", "twos"], geometry = [ bbox_wkb(-77.0, -12.1, -76.9, -12.0), bbox_wkb(-76.9, -12.1, -76.8, -12.0), bbox_wkb(-76.8, -12.1, -76.7, -12.0), ], ) metadata!(df, "geo", JSON3.write(geo); style=:note) archive = joinpath(tmp, "dataset.zip") Cozip.create(archive, df)
Read
Cozip.read returns the manifest as a DataFrame, your custom columns included. Filter however you like, then open the matching files in place with seek and read using the offset and size the writer added.
using Cozip df = Cozip.read(archive) println(df) # filter on your own columns, then read the matching files in place trains = df[df.split .== "train", :] open(archive, "r") do f for row in eachrow(trains) seek(f, row.offset) data = read(f, row.size) println(row.name, " ", length(data), " bytes") end end
Publish
The archive is a plain ZIP so any S3-compatible bucket works. The example below uses asterisk-labs/cozip on Source Cooperative.
# Source Coop creds are temporary STS. Export them in your shell, then upload. export AWS_ACCESS_KEY_ID="<your-key>" export AWS_SECRET_ACCESS_KEY="<your-secret>" export AWS_SESSION_TOKEN="<your-session-token>" aws s3 cp dataset.zip \ s3://us-west-2.opendata.source.coop/asterisk-labs/cozip/dataset.zip \ --region us-west-2
Back in Julia, the read flow is the same with a URL in place of the path.
using Cozip url = "https://data.source.coop/asterisk-labs/cozip/dataset.zip" println(Cozip.read(url))
Explore in the playground
Once the archive is public the cozip playground reads it from the browser. No Julia, no install, no download. The manifest renders as a table, and any file is one click from a copyable URL.
Open the playground and paste the URL below into the input.
https://data.source.coop/asterisk-labs/cozip/dataset.zip