Write

Cozip.create takes a DataFrame (or Arrow.Table) with name and path columns. Anything else rides along as user-defined manifest columns.

using Cozip
using DataFrames

tmp = mktempdir()
paths = String[]
for i in 0:2
    p = joinpath(tmp, "file_$(lpad(i, 4, '0')).bin")
    write(p, repeat("hello cozip\n", 1000))
    push!(paths, p)
end

tbl = DataFrame(
    name = basename.(paths),
    path = paths,
)

Cozip.create("dataset.zip", tbl)

Read

Cozip.read works on a local path or a remote URL with the same call. You get a DataFrame plus an injected cozip:gdal_vsi column ready for ArchGDAL or Rasters.jl.

using Cozip

# local archive
df = Cozip.read("dataset.zip")

# or remote, no full download, two range requests under the hood
df = Cozip.read(
    "https://huggingface.co/datasets/Major-TOM/Core-VIIRS-Nighttime-Light/" *
    "resolve/main/2024/MAJORTOM-VIIRS-NTL_2024_median_000.zip"
)

# hand the cozip:gdal_vsi path straight to ArchGDAL
using ArchGDAL
dataset = ArchGDAL.read(df[1, "cozip:gdal_vsi"])

Query with DuckDB

The cozip community extension reads the same archive over SQL. Pair it with the DuckDB.jl package, or run it straight from the DuckDB CLI.

-- one-time install
INSTALL cozip FROM community;
LOAD cozip;

-- hello world, first 10 entries of the manifest
SELECT *
FROM read_cozip('https://huggingface.co/datasets/Major-TOM/Core-VIIRS-Nighttime-Light/resolve/main/2024/MAJORTOM-VIIRS-NTL_2024_median_000.zip')
LIMIT 10;

-- raw manifest, without the injected /vsisubfile/ column
SELECT *
FROM read_cozip(
    'https://huggingface.co/datasets/Major-TOM/Core-VIIRS-Nighttime-Light/resolve/main/2024/MAJORTOM-VIIRS-NTL_2024_median_000.zip',
    gdal_vsi := false
)
LIMIT 10;

-- filter the manifest, keep the /vsisubfile/ paths for the biggest tifs
SELECT name, "cozip:gdal_vsi", size
FROM read_cozip('https://huggingface.co/datasets/Major-TOM/Core-VIIRS-Nighttime-Light/resolve/main/2024/MAJORTOM-VIIRS-NTL_2024_median_000.zip')
WHERE name LIKE '%.tif'
ORDER BY size DESC
LIMIT 5;