Now Reading
marqo/ at mainline · marqo-ai/marqo · GitHub

marqo/ at mainline · marqo-ai/marqo · GitHub

2023-04-12 02:01:55

Speech is without doubt one of the hottest types of communication and far of the world’s information is held in audio recordings of human speech, whether or not it’s as movies, motion pictures, TV, cellphone calls, assembly recordings and extra. Whereas ample in nature, accessing the content material of speech information is a tough activity, making it searchable is even more durable.

On this article we current a system which extracts audio from on-line sources, identifies audio system, transcribes their speech, indexes them into Marqo after which makes use of the ensuing Marqo index as a supply of fact to reply to questions in regards to the content material. We show the behaviour of this technique by ingesting a various dataset from three distinct domains after which asking very area particular inquiries to a chatbot offered context retrieved with Marqo.


The challenges of working with Audio

The duty of looking out audio is a difficult downside. On this planet of AI, audio is an particularly difficult medium to work with because of its excessive dimensionality and its obfuscation of helpful options when represented as a waveform within the time area. The human ear can hear sounds as much as round 20,000 Hz, this requires a pattern fee of 40,000 Hz to signify; on condition that the typical English speaker within the US speaks at a velocity of two.5 phrases per second some again of the serviette maths says that it requires a median of 16,000 floating level numbers to signify a single phrase on the constancy required to match the human ear (within the time area).

In follow this huge dimensionality downside is made extra palatable by down-sampling the audio, we settle for a loss in high quality in change for extra manageable information. Down-sampling to a 16,000 Hz pattern fee utilizing the numbers mentioned earlier than provides us a extra manageable (however nonetheless giant) common of 6,400 floating level numbers per phrase.

Many AI methods don’t function instantly on the waveform and as an alternative rework home windows of the wavefrom into the frequency area to generate a spectogram, this includes changing chunks of the sign into the frequency area with quick fourier transforms as depicted under.

frequency in time domain to spectogram

Functions for Looking Speech Information

For the avid customers of Marqo you could have come throughout our podcast demo. This was an inspiration for this undertaking.

There are lots of purposes for searchability of speech in business platforms, enterprise and trade.

For platforms that run podcasts or host movies this strategy could possibly be used to reinforce the retrieval of related search outcomes – eradicating a reliance on metadata tagging, search outcomes can embrace a snippet of the media in addition to time stamps to cut back their have to navigate by way of and scrub over the media on the lookout for a particular part.

In an enterprise context this know-how can be utilized to index recordings of conferences to supply organisations with a brand new functionality, the retrieval, attribution and traceability of resolution factors. Speaker diarisation works significantly better when the variety of audio system is understood forward of time; this could possibly be extracted by integrating with assembly software program to watch concurrent audio system all through the assembly, speaker names from the assembly will also be mapped to the transcribed sections and listed.

The traceability functionality extends to trade purposes. Indexing and archiving reside communications information implies that operational choices will be later retrieved – this may be invaluable to posteriori evaluation of resolution making that led to a an consequence, whether or not that be an incident or successful.

Audio and Marqo

Presently Marqo doesn’t assist audio and as an alternative solely has assist for photos and textual content.

As such this text discusses the method of constructing a speech processing entrance finish that wrangles audio information into paperwork which will be listed by Marqo. The method will be damaged into the next steps which will probably be mentioned in additional element in subsequent sections:

  • Ingestion
  • Speaker diarisation & speech to textual content
  • Indexing
  • Looking and knowledge retrieval

The total supply code that’s referred to within the article will be discovered here


The downloader can ingest audio from a wide range of sources, together with YouTube movies, hyperlinks to audio recordsdata and native recordsdata. The AudioWrangler class gives a variety of strategies for downloading and wrangling audio.

Within the constructor, the output listing and non permanent listing are instantiated as occasion variables, the non permanent listing (for intermediate processing) is created if it doesn’t exist, that is executed relative to the placement of the python script itself.

ABS_FILE_FOLDER = os.path.dirname(os.path.abspath(__file__))

class AudioWrangler():
    def __init__(self, output_path: str, clean_up: bool = True):

        self.output_path = output_path

        self.tmp_dir = 'downloads'

        os.makedirs( a part of(ABS_FILE_FOLDER, self.tmp_dir), exist_ok=True)

        if clean_up:
		def convert_to_wav(self, fpath: str):
        sound = AudioSegment.from_file(fpath)
        wav_path = ''.be a part of([p for p in fpath.split(".")[:-1]]) + ".wav"
        sound.export(wav_path, format="wav")
        return wav_path

		def _move_to_output(self, file):
        goal = a part of(self.output_path, os.path.basename(file))
        shutil.transfer(file, goal)
        return goal

YouTube movies have their audio extracted as 192kpbs MP3 recordsdata utilizing the yt_dlp library, Pydub is then used to transform this to WAV. Movies are given a novel title by taking the SHA256 hash of their URL.

def download_from_youtube(self, url: str):
      outf = a part of(
      ydl_opts = {
          "format": "bestaudio/greatest",
          "postprocessors": [
                  "key": "FFmpegExtractAudio",
                  "preferredcodec": "mp3",
                  "preferredquality": "192",
          "fragment_retries": 10,
          "outtmpl": outf,
      with YoutubeDL(ydl_opts) as ydl:

      outf = self.convert_to_wav(outf + ".mp3")
      outf = self._move_to_output(outf)
      return outf

URLs that time to audio recordsdata will also be downloaded utilizing the requests library.

def download_from_web(self, url: str):
    outf = a part of(
        hashlib.sha256(url.encode("ascii")).hexdigest() + f".wav",
    req = urllib.request.Request(url=url, headers={"Person-Agent": "Mozilla/5.0"})
    with urllib.request.urlopen(req) as response, open(outf, "wb") as out_file:
        shutil.copyfileobj(response, out_file)
    outf = self.convert_to_wav(outf)
    outf = self._move_to_output(outf)
    return outf

To make downloading in bulk simpler, you may present a file the place every line is a hyperlink to both a YouTube video or an audio file, this system will obtain and convert all of them, there’s additionally a multiprocessing model of this to obtain in parallel.

def download_from_file(self, file):
    urls = []
    with open(file, "r") as f:
        for url in f.readlines():

def multiprocess_read_url_sources(self, sources: Record[str]):
    pool = Pool(os.cpu_count()), sources)

def read_url_source(self, supply: str):
    if "" in supply:
        return self.download_from_youtube(supply)

    return self.download_from_web(supply)

The Information

Within the code for this text, three collections of audio sources are offered. These are saved as textual content recordsdata with a listing of hyperlinks (YouTube movies). The offered recordsdata embrace:

  • James Hoffman (movies about espresso)
  • The WAN Present (tech podcast with two audio system)
  • Professional Panels (lengthy movies with a number of audio system)

The skilled panels file has two skilled panels from OpenAI, one titled GANs for Good and one titled Optimizing BizOps with AI.

Speaker Diarisation and Speech to Textual content

The speaker diarisation and speech to textual content capabilities are collated collectively within the AudioTranscriber class.

The constructor takes within the Hugging Face token, system and batch measurement for transcription. These are saved into occasion variables alongside the fashions and pipelines required for the audio transcription course of

class AudioTranscriber:
    def __init__(
        self, hf_token: str, system: str = "cpu", transcription_batch_size: int = 4

        self.system = system
        self.sample_rate = 16000

        self.transcription_batch_size = transcription_batch_size

        self._model_size = "medium"

        self.transcription_model = Speech2TextForConditionalGeneration.from_pretrained(
        self.transcription_processor = Speech2TextProcessor.from_pretrained(
        self.annotation_pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization@2.1", use_auth_token=hf_token

Speaker diarisation is the method of figuring out who’s talking at what time in an audio recording. The output of this can be a record of segments with a begin time, finish time and speaker labels. Because the mannequin has no means of figuring out who a speaker is by title these labels are of the shape SPEAKER_00, SPEAKER_01, and so on.

We are able to use the beginning and finish time extracted by the diarisation course of to phase the audio by speaker. To do that we use the library with the speaker-diarization V2.1 mannequin. To make use of this mannequin your self you have to to just accept the phrases and circumstances and procure an API token from hugging face, particulars will be discovered here.

The annotation course of is captured within the annotate methodology which takes within the path for an audio file and returns the speaker annotations, the heavy lifting is wrapped up behind the scenes within the annotation pipeline. Longer segments of speech are chunked into 30 second segments to keep away from very lengthy sections for single speaker audio.

def annotate(self, file: str) -> Record[Tuple[float, float, Set[str]]]:
    diarization = self.annotation_pipeline(file)
    speaker_times = []
    for t in diarization.get_timeline():
        begin, finish = t.begin, t.finish
        # cut back to 30 second chunks in case of lengthy segments
        whereas finish - begin > 0:
                (begin, min(begin + 30, finish), diarization.get_labels(t))
            begin += 30

    return speaker_times

The audio enters the annotation pipeline as a wave file within the temporal area.

wav file

The annotation course of transforms this right into a diarised log of which speaker spoke at what time and for the way lengthy. Overlapping audio system are additionally recognized and segmented.

speaker annotations

The speaker occasions extracted by the annotation can then be used to divide the unique audio into subsections for speech to textual content processing.

For speech to textual content the fairseq S2T mannequin household was used, as seen within the constructor for AudioTranscriber the pretrained medium mannequin from hugging face is utilized by default. Segments of speech are batched earlier than being handed to the preprocessor and the mannequin.

def transcribe(self, datas: Record[np.ndarray], samplerate: int = 16000) -> Record[str]:
    batches = []
    batch = []
    i = 0
    for information in datas:
        # pad quick audio
        if information.form[0] < 400:
            information = np.pad(information, [(0, 400)], mode="fixed")

        i += 1
        if i > self.transcription_batch_size:
            i = 0
            batch = []
    if batch:

    transcriptions = []
    for batch in tqdm(
        batches, desc=f"Processing with batch measurement {self.transcription_batch_size}"
        inputs = self.transcription_processor(
            batch, sampling_rate=samplerate, return_tensors="pt", padding=True
        generated_ids = self.transcription_model.generate(
        transcription_batch = self.transcription_processor.batch_decode(
            generated_ids, skip_special_tokens=True

        transcriptions += transcription_batch

    return transcriptions

The diarisation and transcription are wrapped up collectively into one driver methodology which brings collectively the annotations and the diarisations right into a doc to be listed within the subsequent step.

Every doc has a speaker, begin time, finish time, transcription, samplerate and file.

def process_audio(self, file: str) -> Dict[str, Any]:
    speaker_times = self.annotate(file)
    audio_data, samplerate = librosa.load(file, sr=self.sample_rate)

    datas = []
    for begin, finish, _ in speaker_times:
        datas.append(audio_data[int(start * samplerate) : int(end * samplerate)])

    transcriptions = self.transcribe(datas, samplerate)

    annotated_transcriptions = []
    for i in vary(len(transcriptions)):
                "_id": self._create_id(file, i),
                "speaker": [*speaker_times[i][2]],
                "begin": speaker_times[i][0],
                "finish": speaker_times[i][1],
                "transcription": transcriptions[i],
                "samplerate": samplerate,
                "file": file,

    return annotated_transcriptions


The spine of this complete software is the Marqo database, indexing the annotated transcriptions is made easy by Marqo. As described above every doc has a number of fields nevertheless, we solely wish to calculate tensor embeddings for the ‘transcription’ area, to do that we merely cross the names of the opposite fields by way of to the non_tensor_fields argument.

Some filtering is utilized to take away quick and faulty transcribed segments.

def index_transciptions(
    annotated_transcriptions: Record[Dict[str, Any]],
    index: str,
    mq: marqo.Consumer,
    non_tensor_fields: Record[str] = [],
    system: str = "cpu",
    batch_size: int = 32,
) -> Dict[str, str]:

    # drop quick transcriptions and transcriptions that encompass duplicated repeating
    # character artifacts
    annotated_transcriptions = [
        for at in annotated_transcriptions
        if len(at["transcription"]) > 5 or len({*at["transcription"]}) > 4

    response = mq.index(index).add_documents(

    return response


We are actually capable of search by way of our listed transcriptions and retrieve the unique audio snippets again from the system when executed! This can be a highly effective functionality in itself nevertheless what makes a system like this work properly in follow is a human usable methodology of engagement, to allow this we are able to place a language mannequin between the search outcomes and the top consumer.

Producing Conversational Solutions

Returning verbatim textual content and audio snippets is beneficial for some software nevertheless many customers merely need a straightforward to grasp reply to their questions. To do that we are able to add a language mannequin after the outcomes from Marqo are returned to reply the query with pure language. This course of will be seen as follows:

process flow

Our immediate is constructed as follows, the aim is to push to mannequin to solely reply questions that it is aware of the reply to and for it to generate correct solutions.

You're a query answerer, given the CONTEXT offered you'll reply the QUESTION (additionally offered).
In case you are undecided of the reply then say 'I'm sorry however I have no idea the reply'
Your solutions ought to two to 5 sentences in size and solely comprise data related to the query. You need to match the tone of the CONTEXT.
The beginnings of the CONTEXT must be probably the most related so try to use that wherever potential, it is vital that your solutions a factual and do not make up data that isn't within the CONTEXT.


This context is used within the answer_question operate which searches Marqo, makes use of the transcriptions to make a context after which sends it to the text-davinci-003 mannequin as a part of the immediate.

def answer_question(
    question: str,
    restrict: int,
    index: str,
    mq: marqo.Consumer,
) -> str:
    outcomes = mq.index(index).search(

    context = ". ".be a part of([r["transcription"] for r in outcomes["hits"]])

    immediate = PromptTemplate(template=TEMPLATE, input_variables=["context", "question"])
    llm = OpenAI(temperature=0.9, model_name="text-davinci-003")
    chain_qa = LLMChain(llm=llm, immediate=immediate)
    llm_results = chain_qa(
        {"context": context, "query": question}, return_only_outputs=True
    return llm_results["text"]

This operate is then referred to as in the primary for the chat element:

import marqo
from SpeechSearch.chatter import answer_question

def most important():
    mq = marqo.Consumer(url="http://localhost:8882")

    index_name = "transcription-index"
    whereas True:
        question = enter("Enter a question: ")
        reply = answer_question(

if __name__ == "__main__":
    most important()

The outputs are very spectacular, under are some examples the place all three information units have been listed utilizing Marqo. That is essential to recollect because the database now has data on Espresso, Generative Adversarial Networks, BizOps and know-how information.

Listed below are some espresso associated questions:

Q: How does the Bialetti Moka Categorical work?

A: The Bialetti Moka Categorical is a espresso brewing machine that makes use of steam stress to push sizzling water up and over espresso grounds. It’s made up of a base, a filter basket stuffed with espresso, an higher chamber, and a filter plate. Sizzling water is stuffed into the bottom and is then pushed up by way of the espresso grounds and filter plate, creating stress within the espresso and forcing the espresso up into the higher chamber. This course of ensures a constant extraction of the espresso, making a sweeter and extra complicated cup of espresso.

See Also

Q: What’s the historical past of the Bialetti emblem?

A: The Bialetti emblem is a drawing of a person with a moustache who was apparently based mostly on Renato Bialetti, the corporate’s founder who took over the household enterprise within the Nineteen Fifties. The emblem was created by a cartoonist employed by the Bialetti household to signify the model. The emblem is an iconic illustration of the Italian espresso and has turn out to be synonymous with the Bialetti Firm.

Q: When making espresso with a lightweight roasted espresso, what grind setting ought to I take advantage of?

A: A light-weight roasted espresso must be brewed with a wonderful grind setting to be able to correctly extract all of the flavours. For greatest outcomes, the grind must be finer than that used for filter espresso. As a place to begin, 18g of espresso ought to take round 30 seconds to extract. Nevertheless, actual grind settings can fluctuate from individual to individual, so it’s best to experiment, style and modify till you obtain the specified end result.

Q: What stress ought to the machine use to extract espresso?

A: The stress of the machine is a crucial issue on the subject of extracting espresso. Ideally, the machine must be set to 9 bars of stress to be able to get a robust and concentrated espresso with a thick and candy flavour. A decrease stress could result in a slower shot and an extended contact time, however it will not essentially improve the extraction. Brewing too quick can result in channeling, which reduces extraction and causes a harsh and bitter style.

And a few associated to the WAN present podcast:

Q: What’s the argument round proper to restore?

A: The argument for proper to restore is that individuals ought to have the liberty to restore their very own gadgets as they please. It’s based mostly on the concept that it is best to have the precise to make repairs your self, or pay somebody to do it for you, and that corporations shouldn’t make it overly costly or tough to restore gadgets you personal. The aim is to make sure that corporations can not limit folks’s capacity to restore their very own gadgets, and to make it simpler for folks to restore their very own issues in the event that they select. Proper to restore advocates and crusaders have been pushing for folks to have entry to digital substitute elements and the flexibility to repair their very own telephones.

Q: How does Samsung’s house zoom work?

A: Samsung’s house zoom works through the use of a mix of machine studying and post-processing options. It makes use of the digital camera to seize a low decision picture, after which makes use of machine studying to mix and add particulars from an algorithm that was skilled on excessive decision photos of the moon. This enables the digital camera so as to add sharper particulars and provides the impression of a zoom function. As well as, post-processing options comparable to edge detection and movement blur discount are used to additional enhance the standard of the picture.

Q: What’s the controversy round Samsungs Area Zoom function?

A: The controversy round Samsung’s Area Zoom function is that it implies that their cellphone cameras can take excessive definition footage of the moon and different distant objects, whereas in actuality the function is just capable of zoom in on distant objects with a low diploma of readability. This has raised issues about the usage of this know-how for deep pretend movies, in addition to worries that Samsung could also be purposely misrepresenting the aptitude of their telephones.

And a few outcomes for the info within the skilled panels:

Q: What does BizOps imply for AI?

A: BizOps stands for Enterprise Operations, and it’s an integral a part of the AI trade. AI is utilized in many facets of enterprise operations, comparable to information assortment and evaluation, machine studying, provide chains, and advertising and marketing. AI is used to optimize effectivity and improve profitability, which implies that BizOps will be improved utilizing AI successfully. AI can also be used for predictive analytics and creating automated workflows. BizOps additionally refers back to the relationship between AI corporations and their purchasers, as many AI companies are centered on offering options that may enhance the effectivity and accuracy of their shopper’s operations.

Q: What are some software of GANs which might be useful to the analysis group?

A: GANs have seen lots of purposes lately which might be useful to the analysis group. GANs have been used to generate sensible photos, enhance safety, generate music and extra. GANs are particularly useful to college students or practitioners in analysis fields as they will vastly assist with the issues they see of their day-to-day. GANs are additionally getting used for good in areas comparable to differential privateness and combating bias. GANs are getting used within the medical area as properly, the place they will generate pretend medical information to be used in analysis with out compromising any affected person’s privateness. GANs are additionally getting used for equity and for creating information for underrepresented populations. GANs have been used to study from unlabeled actual photos and make a refiner that upgrades the artificial picture to make it extra sensible.


The weakest hyperlink within the methods is definitely the speech to textual content mannequin, specifically its capacity to cope with acronyms is missing and these can usually trigger some unusual artifacts. Extra experimentation with completely different fashions or fine-tuning fashions for area particular purposes would possible enhance this.

Different newer fashions, comparable to OpenAIs Whisper fashions do carry out the speech to textual content processing higher nevertheless they create their very own distinctive behaviours. For instance, the Whisper fashions have a behavior of placing quotations round strains for segments with overlapping audio system – which for the needs of this software just isn’t the specified behaviour.

Issues can even go improper for shorter queries because the response will be very delicate to the phrase selection, particularly when comparable language is used all through the corpus. These queries may cause extra irrelevant data to be returned which in flip makes the chat mannequin exhibit hallucinatory behaviour.

Closing Ideas

Looking speech information is difficult, however when a lot helpful information is saved as audio waveforms it is vital the we have now methods that may leverage this data. With the precise instruments and methods, it is potential to make speech information searchable and indexable by platforms like Marqo. This text presents solely a quite simple implementation and already achieves prime quality retrieval audio segments utilizing textual content queries.

The total code and setup directions will be discovered on Owen Elliott’s GitHub.

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top