llm-dataset-converter release

Applied Machine Learning Group, University of Waikato

2024-07-05 13:02

Version 0.2.4 of our llm-dataset-converter library is now available.

This release is a only minor release, mainly fixing batch processing and offering default globs for readers. The support for default globs means that the user only has to supply the directory, i.e., in a bash shell it is no longer required to double quote the input to avoid bash expansion. Additional libraries had support for default globs added as well where appropriate.

The llm-dataset-converter-all meta-library now stands at version 0.0.2.

image-dataset-converter release

Applied Machine Learning Group, University of Waikato

2024-07-02 09:56

Our image-dataset-converter library keeps evolving and, apart from fixing bugs, we also keep adding useful stuff.

The version of the image-dataset-converter-all library stands now at 0.0.3.

Since version 0.0.2 the following changes occurred:

image-dataset-converter (core library):
- switched to the fast-opex library
- helper method from_indexedpng was using incorrect label index (off by 1)
- Data.save_image method now ensures that source/target files exist before calling os.path.samefile
- requiring seppl>=0.2.6 now
- readers now support default globs, allowing the user to just specify directories as input (and the default glob gets appended)
- the to-yolo-od writer now has an option for predefined labels (for enforcing label order)
- the to-yolo-od writer now stores the labels/labels_cvs files in the respective output folders rather than using an absolute file name
- the bluechannel/grayscale/indexed-png image segmentation readers/writers can use a value other than 0 now for the background
- split filter has been renamed to split-records
image-dataset-converter-imgaug: added find-contours filter for turning blobs in image segmentation annotations into object detection polygons.
image-dataset-converter-imgvis: added add-center-overlay-od overlay filter
image-dataset-converter-pdf (new module): adds support for PDF, like extracting images from PDF and compiling PDF from images

fast-opex released

Applied Machine Learning Group, University of Waikato

2024-06-18 16:51

The OPEX (Object Predictions EXchange) format features heavily in our docker images for storing/broadcasting predictions. However, last week I noticed that it incurs quite a significant speed penalty due to its use of JSON schema under the hood. Since we want to be as fast as possible at prediction time, I sat down and rewrote the library using very basic (but fast) checks and released it under the name fast-opex. The new library works as a drop-in replacement, i.e., you only have to switch from installing opex to fast-opex.

To further speed things up, the new library can take advantage of the blazingly fast orjson JSON library. The orjson library only needs to be present in the environment and it will be used automatically.

If you are interested in a speed comparison, then head over to the following repository:

https://github.com/waikato-datamining/opex-comparison

Faster Whisper 1.0.2 (speech-to-text)

Applied Machine Learning Group, University of Waikato

2024-05-28 08:29

New Docker images are now available for speech-to-text using Faster Whisper 1.0.2:

https://github.com/waikato-llm/whisper/tree/main/faster-whisper-1.0.2_cuda12.1

https://github.com/waikato-llm/whisper/tree/main/faster-whisper-1.0.2_cpu

Faster Whisper is a reimplementation of OpenAI's Whisper library with some dramatic speed ups.

With the release of these images, the Coqui STT images have been retired (just like the Coqui STT project itself).

image-dataset-converter release

Applied Machine Learning Group, University of Waikato

2024-05-06 16:12

Based on lessons learned from our wai-annotations library, we simplified and streamlined the design of a data processing library (though limited to just image datasets). Of course, it makes use of the latest seppl version, which also simplified how plugins are being located at runtime and development time.

The new kid on the block is called image-dataset-converter and its code is located here:

https://github.com/waikato-datamining/image-dataset-converter

Whilst it is based on wai-annotations, it already contains additional functionality.

And, of course, we also have resources demonstrating how to use the new library:

https://www.data-mining.co.nz/image-dataset-converter-examples/

llm-dataset-converter release

Applied Machine Learning Group, University of Waikato

2024-05-06 13:36

Version 0.2.3 of our llm-dataset-converter library is now available.

Quite a number of changes have happened since the first release last year, like xtuner support, so check out the full change log here:

https://github.com/waikato-llm/llm-dataset-converter/blob/main/CHANGES.rst

XTuner Docker images available

Applied Machine Learning Group, University of Waikato

2024-04-22 11:27

Docker images for XTuner 0.1.18 are now available:

In-house registry:
- public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-xtuner:0.1.18_cuda11.7
Docker hub:
- waikatodatamining/pytorch-xtuner:0.1.18_cuda11.7

XTuner 0.1.18 now supports the just released llama-3 models (e.g., Meta-Llama-3-8B-Instruct).

MMPretrain 1.2.0 Docker images available

Applied Machine Learning Group, University of Waikato

2024-03-14 11:55

First Docker images are available for the MMPretrain framework, using the 1.2.0 release of MMPretrain (code base as of 2024-01-05):

NB: MMPretrain is the successor of MMClassification, which can be used for image classification.

XTuner Docker images available

Applied Machine Learning Group, University of Waikato

2024-02-27 16:40

XTuner is an efficient, flexible and full-featured toolkit for fine-tuning large models (InternLM, Llama, Baichuan, Qwen, ChatGLM) and released under the Apache 2.0 license. The advantage of this framework is that it is not tied down to a specific LLM architecture, but supports multiple ones out of the box. With the just released version v0.2.0 of our llm-dataset-converter Python library, you can read and write the XTuner JSON format (and apply the usual filtering, of course).

Here are the newly added image tags:

In-house registry:
- public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-xtuner:2024-02-19_cuda11.7
Docker hub:
- waikatodatamining/pytorch-xtuner:2024-02-19_cuda11.7

Of course, you can use these Docker images in conjunction with our gifr Python library for gradio interfaces as well (gifr-textgen). Just now we released version 0.0.4 of the library, which is more flexible in regards to text generation: it can now support send and receive the conversation history and also parse JSON responses.

Text classification support

Applied Machine Learning Group, University of Waikato

2024-02-15 16:46

Large language models (LLMs) for chatbots are all the rage at the moment, but there is plenty of scope of simpler tasks like text classification. Requiring less resources and being a lot faster is nice as well.

We turned the HuggingFace example for sequence classification into a docker image to make it easy for building such classification models.

In-house registry:
- public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-huggingface-transformers:4.36.0_cuda11.7_classification
Docker hub:
- waikatodatamining/pytorch-huggingface-transformers:4.36.0_cuda11.7_classification

Our gifr Python library for gradio received an interface for text classification (gifr-textclass) in version 0.0.3.

The llm-dataset-converter library obtained native support for text classification formats with version 0.1.1.