Constructing a Semantic E-book Search: Scale an Embedding Pipeline with Apache Spark and AWS EMR Serverless | by Eva Revear

Machine Learning

Constructing a Semantic E-book Search: Scale an Embedding Pipeline with Apache Spark and AWS EMR Serverless | by Eva Revear | Jan, 2024

hhhhm

2024年2月20日

Constructing a Semantic E-book Search: Scale an Embedding Pipeline with Apache Spark and AWS EMR Serverless | by Eva Revear | Jan, 2024

[ad_1]

Utilizing OpenAI’s Clip mannequin to help pure language search on a set of 70k e book covers

In a earlier submit I did a little bit PoC to see if I might use OpenAI’s Clip mannequin to construct a semantic e book search. It labored surprisingly properly, in my view, however I couldn’t assist questioning if it will be higher with extra knowledge. The earlier model used solely about 3.5k books, however there are tens of millions within the Openlibrary knowledge set, and I assumed it was worthwhile to attempt including extra choices to the search area.

Nonetheless, the total dataset is about 40GB, and attempting to deal with that a lot knowledge on my little laptop computer, and even in a Colab pocket book was a bit a lot, so I had to determine a pipeline that would handle filtering and embedding a bigger knowledge set.

TLDR; Did it enhance the search? I feel it did! We 15x’ed the info, which provides the search rather more to work with. Its not good, however I assumed the outcomes had been pretty fascinating; though I haven’t finished a proper accuracy measure.

This was one instance I couldn’t get to work regardless of how I phrased it within the final iteration, however works pretty properly within the model with extra knowledge.

In case you’re curious you may attempt it out in Colab!

Total, it was an fascinating technical journey, with a number of roadblocks and studying alternatives alongside the best way. The tech stack nonetheless consists of the OpenAI Clip mannequin, however this time I leverage Apache Spark and AWS EMR to run the embedding pipeline.

This appeared like a superb alternative to make use of Spark, because it permits us to parallelize the embedding computation.

I made a decision to run the pipeline in EMR Serverless, which is a reasonably new AWS providing that gives a serverless setting for EMR and manages scaling assets robotically. I felt it will work properly for this use case — versus spinning up an EMR on EC2 cluster — as a result of this can be a pretty ad-hoc challenge, I’m paranoid about cluster prices, and initially I used to be uncertain about what assets the job would require. EMR Serverless makes it fairly simple to experiment with job parameters.

Under is the total course of I went by way of to get all the things up and operating. I think about there are higher methods to handle sure steps, that is simply what ended up working for me, so if in case you have ideas or opinions, please do share!

Constructing an embedding pipeline job with Spark

The preliminary step was writing the Spark job(s). The complete pipeline is damaged out into two levels, the primary takes within the preliminary knowledge set and filters for latest fiction (inside the final 10 years). This resulted in about 250k books, and round 70k with cowl photos out there to obtain and embed within the second stage.

First we pull out the related columns from the uncooked knowledge file.