image-dataset-converter release

A new release of our image-dataset-converter-all library is now available: 0.0.12. Docker images have been deployed as well.

The most notably changes since 0.0.11 are:

  • dropped numpy<2.0.0 restriction

  • added grayscale-to-binary filter

  • fix: sort-pixels, rgb-to-grayscale filters

  • the rename filter now supports lower/upper case placeholders of name and extension as well

  • requiring seppl>=0.2.17 now for skippable plugin support and avoiding deprecated use of pkg_resources

  • added any-to-rgb filter for turning binary/grayscale images back into RGB ones

  • added label-to-metadata filter for transferring labels into meta-data

  • added metadata-to-placeholder filter for transferring meta-data into placeholders

  • added basic support for images with associated depth information: DepthData, DepthInformation

  • added depth-to-grayscale filter for converting depth information to grayscale image

  • added depth information readers from-grayscale-dp, from-numpy-dp, from-csv-dp and from-pfm-dp

  • added depth information writers to-grayscale-dp, to-numpy-dp, to-csv-dp and to-pfm-dp

  • added apply-ext-mask filter to applying external PNG masks to image containers (image and/or annotations)

  • added apply-label-mask filter for applying image segmentation label masks to their base images

  • added label-present-ic and label-present-is that ensure that certain label(s) are present or otherwise discard the image

  • filter label-present was renamed to label-present-od but keeping label-present as alias for the time being

  • fix: imgseg_to_bluechannel, imgseg_to_indexedpng and imgseg_to_grayscale now handle overlapping pixels correctly, no longer adding them up and introducing additional labels

  • discard-by-name filter can use names of files in specified paths now as well

  • fixed the construction of the error messages in the pyfunc reader/filter/writer classes

llm-dataset-converter release

Version 0.2.7 of our llm_dataset_converter library has been release. New release of ldc_doc, ldc_docx, ldc_faster_whisper, ldc_google, ldc_openai, ldc_pdf and ldc_tint have been made available as well.

The meta-library that combines all the libraries now stands at version 0.0.6:

llm-dataset-converter-all

A new Docker image is available as well:

https://hub.docker.com/r/waikatodatamining/llm-dataset-converter/tags

This release is mostly a maintenance release, but still had some useful additions:

  • added set-placeholder filter for dynamically setting (temporary) placeholders at runtime

  • added remove-strings filter that just removes sub-strings

  • added strip-strings filter for stripping whitespaces from start/end of strings

audio-dataset-converter release

A new release of our audio-dataset-converter library and it various additional dependent libraries is out.

The meta-library that combines all the libraries now stands at version 0.0.3:

audio-dataset-converter-all

A new Docker image is available as well:

https://hub.docker.com/r/waikatodatamining/audio-dataset-converter/tags

Notable changes:

  • improved support for placeholders via the set-placeholder and metadata-to-placeholder filters

  • added from-multi and to-multi for combining multiple readers/writers

  • added the --resume_from option to readers to allow resuming the processing from a specific file

  • added the --split_group option ti writers: a regular expression with a single group used for keeping items in the same split, e.g., for identifying the base name of a file or the ID

spectral-data-converter release

The first release of our spectral-data-converter-all library is now available: 0.0.1. Docker images have been deployed as well.

This library allows you to define and run processing pipelines on the command-line, e.g., for:

  • converting data from one format into another (e.g., OPUS to NIR)

  • clean the data (e.g., IQR)

  • transform the data (e.g., SIMPLS, PLS1, standardize)

  • build and apply scikit-learn models

You can find examples for various scenarios here:

data-mining.co.nz/spectral-data-converter-examples/

BitNet Docker image available

First Docker image is available for Microsoft's BitNet small language model (SLM):

Below is an example on how to use these images (on Linux or on Windows under WSL2).

Prerequisites:

  • create a directory for your models and output eg "bitnet"

  • in that directory create the following sub-directories

    • cache

    • triton

    • models

    • logs

Interacting with the language model:

  • from the "bitnet" directory launch the docker image in interactive mode:

    docker run --shm-size 8G --net=host \
        -u $(id -u):$(id -g) -e USER=$USER \
        -v `pwd`:/workspace \
        -v `pwd`/cache:/.cache \
        -v `pwd`/triton:/.triton \
        -it waikatodatamining/bitnet:2025-05-30_cpu
  • as a one-off, download the BitNet-b1.58-2B-4T model from within the Docker container:

    huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf \
        --local-dir /workspace/models/BitNet-b1.58-2B-4T
  • once the model is in place, you can interact with with it:

    bitnet_run_inference \
        -m /workspace/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
        -p "You are a helpful assistant" \
        -n 1024 \
        -cnv

S3000 REST webservice support

While our commercial framework for laboratories, S3000, had support for making predictions via webservices for a long time, that was limited to asynchronous ones: a webservice endpoint receives data coming in and, once the predictions have been generated, the results get forwarded to another webservice.

With recent changes to the codebase, it is now possible to offer synchronous REST webservices as well. In order to reduce latency as much as possible, provenance logging under the hood has been modified to have a much higher throughput that no longer impacts the speed of the predictions.

Thanks to the plugin architecture of S3000, customer-specific webservices can be implemented and deployed with minimal effort.

image-dataset-converter release

A new release of our image-dataset-converter-all library is now available: 0.0.11. Docker images have been deployed as well.

The most notably changes since 0.0.7 are:

  • support for placeholders is now available for readers/writers, which can be used in constructing input/output files/folders, including predefined ones available ({CWD}, {HOME}, {TMP}), input-based ones (e.g., {INPUT_PATH}, {INPUT_NAMEEXT}), user-defined ones (supplied to tools, e.g., via the -p/--placeholders option of the idc-convert tool) and run-time ones (set with the set-placeholder filter)

  • added the --resume_from option to applicable readers, which allows resuming the pipeline from the file matching the supplied glob, e.g., */012345.jpg

  • the new from-multi reader and to-multi writer simplify the combining of datasets (from potentially different formats) and output in multiple formats respectively

  • writers that can split the incoming stream into subsets had the new --split_group option added, which allows keeping samples together within subsets using a regular expression, e.g., when dealing with images that were split into sub-grids or augmented with flipping/rotating

SpeciesNet 4.0.1 Docker images available

First Docker images are available for the SpeciesNet network that Google announced on March 3rd, 2025:

Below is an example on how to use these images (on Linux or on Windows under WSL2).

Prerequisites:

  • create a directory for your output eg "speciesnet"

  • in that directory create the following sub-directories

    • cache

    • config

    • data

    • output

Processing data:

  • copy the images that you want to analyze into the "speciesnet/data" directory

  • from the "speciesnet" directory launch the appropriate docker image in interactive mode

    • CPU:

      docker run --rm --gpus=all --shm-size 8G --net=host \
        -u $(id -u):$(id -g) -e USER=$USER \
        -v `pwd`:/workspace \
        -v `pwd`/cache:/.cache \
        -v `pwd`/config:/.config \
        -v `pwd`/cache:/.torch \
        -it waikatodatamining/speciesnet:4.0.1_cpu
    • CUDA:

      docker run --rm --gpus=all --shm-size 8G --net=host \
        -u $(id -u):$(id -g) -e USER=$USER \
        -v `pwd`:/workspace \
        -v `pwd`/cache:/.cache \
        -v `pwd`/config:/.config \
        -v `pwd`/cache:/.torch \
        -it waikatodatamining/speciesnet:4.0.1_cuda21.1
  • run the following script to process your images:

    speciesnet_run_model \
        --folders "/workspace/data" \
        --predictions_json "/workspace/output/predictions.json"

Or, if you want to run the individual steps separately:

speciesnet_run_model --detector_only \
    --folders "/workspace/data" \
    --predictions_json "/workspace/output/detections.json"
speciesnet_run_model --classifier_only \
    --folders "/workspace/data" \
    --detections_json "/workspace/output/detections.json" \
    --predictions_json "/workspace/output/classifications.json"
speciesnet_run_model --ensemble_only \
    --folders "/workspace/data" \
    --detections_json "/workspace/output/detections.json" \
    --classifications_json "/workspace/output/classifications.json" \
    --predictions_json "/workspace/output/predictions.json"

On your host system, the "speciesnet/output" directory will then contain the generated .json file(s), with "predictions.json" containing all the relevant information (classification and bbox).

For more information on the json output format:

https://github.com/google/cameratrapai/tree/main?tab=readme-ov-file#output-format