Metadata-Version: 2.1
Name: flashlight-text
Version: 0.0.1
Summary: Flashlight Text bindings for Python
Home-page: https://github.com/flashlight/text
Author: Jacob Kahn
Author-email: jacobkahn1@gmail.com
License: BSD licensed, as found in the LICENSE file
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# Flashlight Text Python Bindings
### Quickstart

Flashlight Text is available on PyPI **without KenLM support** using:
```bash
pip install flashlight-text  # without KenLM support
```
For now, building from source is required for KenLM support. We'll be adding KenLM support to the PyPI package soon.

#### Contents
- [Installation](#installation)
  * [Dependencies](#dependencies)
  * [Build Instructions](#build-instructions)
  * [Advanced Options](#advanced-options)
- [Python API Documentation](#python-api-documentation)
  * [Beam search decoder](#beam-search-decoder)
  * [Beam search decoding with your own language model](#decoding-with-your-own-language-model)
- [Examples](#examples)

## Installation
### Dependencies
We require `python >= 3.6` with the following packages installed:
- [cmake](https://cmake.org/) >= 3.16, and `make`
- [KenLM](https://github.com/kpu/kenlm)

### Build Instructions

Once the dependencies are satisfied, from the project root, use:
```
pip install .
```

Using the environment variable `USE_KENLM=0` removes the KenLM dependency but precludes using the decoder with a language model unless you write C++/`pybind11` bindings for your own language model.

Install in editable mode for development:
```
pip install -e .
```

(`pypi` installation coming soon)

**Note:** if you encounter errors, you'll probably have to `rm -rf build dist` before retrying the install.

## Python API Documentation

### Beam Search Decoder
Bindings for the lexicon and lexicon-free beam search decoders are supported for CTC/ASG models only (no seq2seq model support). Out-of-the-box language model support includes KenLM; users can define custom a language model in Python and use it for decoding; see the [documentation](#define-your-own-language-model-for-beam-search-decoding) below.

To run decoder one first should define options:
```python
    from flashlight.lib.text.decoder import LexiconDecoderOptions, LexiconFreeDecoderOptions

    # for lexicon-based decoder
    options = LexiconDecoderOptions(
        beam_size, # number of top hypothesis to preserve at each decoding step
        token_beam_size, # restrict number of tokens by top am scores (if you have a huge token set)
        beam_threshold, # preserve a hypothesis only if its score is not far away from the current best hypothesis score
        lm_weight, # language model weight for LM score
        word_score, # score for words appearance in the transcription
        unk_score, # score for unknown word appearance in the transcription
        sil_score, # score for silence appearance in the transcription
        log_add, # the way how to combine scores during hypotheses merging (log add operation, max)
        criterion_type # supports only CriterionType.ASG or CriterionType.CTC
    )
    # for lexicon free-based decoder
    options = LexiconFreeDecoderOptions(
        beam_size, # number of top hypothesis to preserve at each decoding step
        token_beam_size, # restrict number of tokens by top am scores (if you have a huge token set)
        beam_threshold, # preserve a hypothesis only if its score is not far away from the current best hypothesis score
        lm_weight, # language model weight for LM score
        sil_score, # score for silence appearance in the transcription
        log_add, # the way how to combine scores during hypotheses merging (log add operation, max)
        criterion_type # supports only CriterionType.ASG or CriterionType.CTC
    )
```

Now, prepare a tokens dictionary (tokens for which a model
returns probability for each frame) and a lexicon (mapping between words and their spellings within a tokens set).

For further details on tokens and lexicon file formats, see the [Data Preparation](https://github.com/flashlight/flashlight/tree/master/flashlight/app/asr#data-preparation) documentation in [Flashlight](https://github.com/flashlight/flashlight).

```python
from flashlight.lib.text.dictionary import Dictionary, load_words, create_word_dict

tokens_dict = Dictionary("path/tokens.txt")
# for ASG add used repetition symbols, for example
# token_dict.add_entry("1")
# token_dict.add_entry("2")

lexicon = load_words("path/lexicon.txt") # returns LexiconMap
word_dict = create_word_dict(lexicon) # returns Dictionary
```

To create a KenLM language model, use:

```python
from flashlight.lib.text.decoder import KenLM
lm = KenLM("path/lm.arpa", word_dict) # or "path/lm.bin"
```

Get the unknown and silence token indices from the token and word dictionaries to pass to the decoder:

```python
sil_idx = token_dict.get_index("|")
unk_idx = word_dict.get_index("<unk>")
```

Now, define the lexicon `Trie` to restrict the beam search decoder search:

```python
from flashlight.lib.text.decoder import Trie, SmearingMode
from flashlight.lib.text.dictionary import pack_replabels

trie = Trie(token_dict.index_size(), sil_idx)
start_state = lm.start(False)

def tkn_to_idx(spelling: list, token_dict : Dictionary, maxReps : int = 0):
    result = []
    for token in spelling:
        result.append(token_dict.get_index(token))
    return pack_replabels(result, token_dict, maxReps)


for word, spellings in lexicon.items():
    usr_idx = word_dict.get_index(word)
    _, score = lm.score(start_state, usr_idx)
    for spelling in spellings:
        # convert spelling string into vector of indices
        spelling_idxs = tkn_to_idx(spelling, token_dict, 1)
        trie.insert(spelling_idxs, usr_idx, score)

    trie.smear(SmearingMode.MAX) # propagate word score to each spelling node to have some lm proxy score in each node.
```

Finally, we can run lexicon-based decoder:

```python
import numpy
from flashlight.lib.text.decoder import LexiconDecoder


blank_idx = token_dict.get_index("#") # for CTC
transitions = numpy.zeros((token_dict.index_size(), token_dict.index_size()) # for ASG fill up with correct values
is_token_lm = False # we use word-level LM
decoder = LexiconDecoder(options, trie, lm, sil_idx, blank_idx, unk_idx, transitions, is_token_lm)
# emissions is numpy.array of emitting model predictions with shape [T, N], where T is time, N is number of tokens
results = decoder.decode(emissions.ctypes.data, T, N)
# results[i].tokens contains tokens sequence (with length T)
# results[i].score contains score of the hypothesis
# results is sorted array with the best hypothesis stored with index=0.
```

### Decoding with your own language model
One can define custom language model in python and use it for beam search decoding.

To store language model state, derive from the `LMState` base class and define additional data corresponding to each state by creating `dict(LMState, info)` inside the language model class:

```python
import numpy
from flashlight.lib.text.decoder import LM


class MyPyLM(LM):
    mapping_states = dict() # store simple additional int for each state

    def __init__(self):
        LM.__init__(self)

    def start(self, start_with_nothing):
        state = LMState()
        self.mapping_states[state] = 0
        return state

    def score(self, state : LMState, token_index : int):
        """
        Evaluate language model based on the current lm state and new word
        Parameters:
        -----------
        state: current lm state
        token_index: index of the word
                    (can be lexicon index then you should store inside LM the
                    mapping between indices of lexicon and lm, or lm index of a word)

        Returns:
        --------
        (LMState, float): pair of (new state, score for the current word)
        """
        outstate = state.child(token_index)
        if outstate not in self.mapping_states:
            self.mapping_states[outstate] = self.mapping_states[state] + 1
        return (outstate, -numpy.random.random())

    def finish(self, state: LMState):
        """
        Evaluate eos for language model based on the current lm state

        Returns:
        --------
        (LMState, float): pair of (new state, score for the current word)
        """
        outstate = state.child(-1)
        if outstate not in self.mapping_states:
            self.mapping_states[outstate] = self.mapping_states[state] + 1
        return (outstate, -1)
```

LMState is a C++ base class for language model state. Its `compare` method (for comparing one state with another) is used inside the beam search decoder.
It also has a `LMState child(int index)` method which returns a state obtained by following the token with this index from current state.

All LM states are organized as a trie. We use the `child` method in python to properly create this trie (which will be used inside the decoder to compare states) and can store additional state data in `mapping_states`.

This language model can be used as follows. Here, we print the state and its additional stored info inside `lm.mapping_states`:

```python
custom_lm = MyLM()

state = custom_lm.start(True)
print(state, custom_lm.mapping_states[state])

for i in range(5):
    state, score = custom_lm.score(state, i)
    print(state, custom_lm.mapping_states[state], score)

state, score = custom_lm.finish(state)
print(state, custom_lm.mapping_states[state], score)
```

and for the decoder:

```python
decoder = LexiconDecoder(options, trie, custom_lm, sil_idx, blank_inx, unk_idx, transitions, False)
```

## Tests and Examples

An integration test for Python decoder bindings can be found in `bindings/python/test/test_decoder.py`. To run, use:
```bash
cd bindings/python/test
python3 -m unittest discover -v .
```
