spectral-data-converter release

Applied Machine Learning Group, University of Waikato

2025-07-11 11:23

A new release of our spectral-data-converter-all library is now available: 0.0.2. Docker images have been deployed as well.

This release contains couple of major of changes:

support for direct I/O: most readers/writers can operate on file-like objects now as well
reading from/writing to ZIP files: from-zip, to-zip

image-dataset-converter release

Applied Machine Learning Group, University of Waikato

2025-07-11 10:23

A new release of our image-dataset-converter-all library is now available: 0.0.12. Docker images have been deployed as well.

The most notably changes since 0.0.11 are:

dropped numpy<2.0.0 restriction
added grayscale-to-binary filter
fix: sort-pixels, rgb-to-grayscale filters
the rename filter now supports lower/upper case placeholders of name and extension as well
requiring seppl>=0.2.17 now for skippable plugin support and avoiding deprecated use of pkg_resources
added any-to-rgb filter for turning binary/grayscale images back into RGB ones
added label-to-metadata filter for transferring labels into meta-data
added metadata-to-placeholder filter for transferring meta-data into placeholders
added basic support for images with associated depth information: DepthData, DepthInformation
added depth-to-grayscale filter for converting depth information to grayscale image
added depth information readers from-grayscale-dp, from-numpy-dp, from-csv-dp and from-pfm-dp
added depth information writers to-grayscale-dp, to-numpy-dp, to-csv-dp and to-pfm-dp
added apply-ext-mask filter to applying external PNG masks to image containers (image and/or annotations)
added apply-label-mask filter for applying image segmentation label masks to their base images
added label-present-ic and label-present-is that ensure that certain label(s) are present or otherwise discard the image
filter label-present was renamed to label-present-od but keeping label-present as alias for the time being
fix: imgseg_to_bluechannel, imgseg_to_indexedpng and imgseg_to_grayscale now handle overlapping pixels correctly, no longer adding them up and introducing additional labels
discard-by-name filter can use names of files in specified paths now as well
fixed the construction of the error messages in the pyfunc reader/filter/writer classes

llm-dataset-converter release

Applied Machine Learning Group, University of Waikato

2025-07-11 09:20

Version 0.2.7 of our llm_dataset_converter library has been release. New release of ldc_doc, ldc_docx, ldc_faster_whisper, ldc_google, ldc_openai, ldc_pdf and ldc_tint have been made available as well.

The meta-library that combines all the libraries now stands at version 0.0.6:

llm-dataset-converter-all

A new Docker image is available as well:

https://hub.docker.com/r/waikatodatamining/llm-dataset-converter/tags

This release is mostly a maintenance release, but still had some useful additions:

added set-placeholder filter for dynamically setting (temporary) placeholders at runtime
added remove-strings filter that just removes sub-strings
added strip-strings filter for stripping whitespaces from start/end of strings

audio-dataset-converter release

Applied Machine Learning Group, University of Waikato

2025-07-10 13:07

A new release of our audio-dataset-converter library and it various additional dependent libraries is out.

The meta-library that combines all the libraries now stands at version 0.0.3:

audio-dataset-converter-all

A new Docker image is available as well:

https://hub.docker.com/r/waikatodatamining/audio-dataset-converter/tags

Notable changes:

improved support for placeholders via the set-placeholder and metadata-to-placeholder filters
added from-multi and to-multi for combining multiple readers/writers
added the --resume_from option to readers to allow resuming the processing from a specific file
added the --split_group option ti writers: a regular expression with a single group used for keeping items in the same split, e.g., for identifying the base name of a file or the ID

spectral-data-converter release

Applied Machine Learning Group, University of Waikato

2025-06-27 10:23

The first release of our spectral-data-converter-all library is now available: 0.0.1. Docker images have been deployed as well.

This library allows you to define and run processing pipelines on the command-line, e.g., for:

converting data from one format into another (e.g., OPUS to NIR)
clean the data (e.g., IQR)
transform the data (e.g., SIMPLS, PLS1, standardize)
build and apply scikit-learn models

You can find examples for various scenarios here:

data-mining.co.nz/spectral-data-converter-examples/

djl-arff release

Applied Machine Learning Group, University of Waikato

2025-05-02 11:23

Deep Java Library (DJL) is an open source library to build and deploy deep learning in Java, developed by Amazon.com. Besides the usual image models, it also offers some basic support for tabular data. Since its input is limited to CSV files, I decided to add support for Weka ARFF files.

The result of this effort is the djl-arff library:

https://github.com/waikato-datamining/djl-arff/

BitNet Docker image available

Applied Machine Learning Group, University of Waikato

2025-04-30 16:59

First Docker image is available for Microsoft's BitNet small language model (SLM):

CPU

Below is an example on how to use these images (on Linux or on Windows under WSL2).

Prerequisites:

create a directory for your models and output eg "bitnet"
in that directory create the following sub-directories
- cache
- triton
- models
- logs

Interacting with the language model:

from the "bitnet" directory launch the docker image in interactive mode:

docker run --shm-size 8G --net=host \
    -u $(id -u):$(id -g) -e USER=$USER \
    -v `pwd`:/workspace \
    -v `pwd`/cache:/.cache \
    -v `pwd`/triton:/.triton \
    -it waikatodatamining/bitnet:2025-05-30_cpu

as a one-off, download the BitNet-b1.58-2B-4T model from within the Docker container:

huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf \
    --local-dir /workspace/models/BitNet-b1.58-2B-4T

once the model is in place, you can interact with with it:

bitnet_run_inference \
    -m /workspace/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
    -p "You are a helpful assistant" \
    -n 1024 \
    -cnv

S3000 REST webservice support

Applied Machine Learning Group, University of Waikato

2025-04-03 10:35

While our commercial framework for laboratories, S3000, had support for making predictions via webservices for a long time, that was limited to asynchronous ones: a webservice endpoint receives data coming in and, once the predictions have been generated, the results get forwarded to another webservice.

With recent changes to the codebase, it is now possible to offer synchronous REST webservices as well. In order to reduce latency as much as possible, provenance logging under the hood has been modified to have a much higher throughput that no longer impacts the speed of the predictions.

Thanks to the plugin architecture of S3000, customer-specific webservices can be implemented and deployed with minimal effort.

image-dataset-converter release

Applied Machine Learning Group, University of Waikato

2025-04-03 10:23

A new release of our image-dataset-converter-all library is now available: 0.0.11. Docker images have been deployed as well.

The most notably changes since 0.0.7 are:

support for placeholders is now available for readers/writers, which can be used in constructing input/output files/folders, including predefined ones available ({CWD}, {HOME}, {TMP}), input-based ones (e.g., {INPUT_PATH}, {INPUT_NAMEEXT}), user-defined ones (supplied to tools, e.g., via the -p/--placeholders option of the idc-convert tool) and run-time ones (set with the set-placeholder filter)
added the --resume_from option to applicable readers, which allows resuming the pipeline from the file matching the supplied glob, e.g., */012345.jpg
the new from-multi reader and to-multi writer simplify the combining of datasets (from potentially different formats) and output in multiple formats respectively
writers that can split the incoming stream into subsets had the new --split_group option added, which allows keeping samples together within subsets using a regular expression, e.g., when dealing with images that were split into sub-grids or augmented with flipping/rotating

SpeciesNet 4.0.1 Docker images available

Applied Machine Learning Group, University of Waikato

2025-03-05 13:21

First Docker images are available for the SpeciesNet network that Google announced on March 3rd, 2025:

Below is an example on how to use these images (on Linux or on Windows under WSL2).

Prerequisites:

create a directory for your output eg "speciesnet"
in that directory create the following sub-directories
- cache
- config
- data
- output

Processing data:

copy the images that you want to analyze into the "speciesnet/data" directory

from the "speciesnet" directory launch the appropriate docker image in interactive mode

CPU:

docker run --rm --gpus=all --shm-size 8G --net=host \
  -u $(id -u):$(id -g) -e USER=$USER \
  -v `pwd`:/workspace \
  -v `pwd`/cache:/.cache \
  -v `pwd`/config:/.config \
  -v `pwd`/cache:/.torch \
  -it waikatodatamining/speciesnet:4.0.1_cpu

CUDA:

docker run --rm --gpus=all --shm-size 8G --net=host \
  -u $(id -u):$(id -g) -e USER=$USER \
  -v `pwd`:/workspace \
  -v `pwd`/cache:/.cache \
  -v `pwd`/config:/.config \
  -v `pwd`/cache:/.torch \
  -it waikatodatamining/speciesnet:4.0.1_cuda21.1

run the following script to process your images:

speciesnet_run_model \
    --folders "/workspace/data" \
    --predictions_json "/workspace/output/predictions.json"

Or, if you want to run the individual steps separately:

speciesnet_run_model --detector_only \
    --folders "/workspace/data" \
    --predictions_json "/workspace/output/detections.json"
speciesnet_run_model --classifier_only \
    --folders "/workspace/data" \
    --detections_json "/workspace/output/detections.json" \
    --predictions_json "/workspace/output/classifications.json"
speciesnet_run_model --ensemble_only \
    --folders "/workspace/data" \
    --detections_json "/workspace/output/detections.json" \
    --classifications_json "/workspace/output/classifications.json" \
    --predictions_json "/workspace/output/predictions.json"

On your host system, the "speciesnet/output" directory will then contain the generated .json file(s), with "predictions.json" containing all the relevant information (classification and bbox).

For more information on the json output format:

https://github.com/google/cameratrapai/tree/main?tab=readme-ov-file#output-format