Best Approach to Clientside Machine Learning for Text Classification

Question

I have approximately 100k rows of text data (initially PDF documents that have been OCR). Most are rows of less than 5000 characters. Each of the source documents are addressed to some department. These are typically in the form of the below examples where the target department would 'Urology' (there are several departments).

Urologly Department
Urologly Clinic
Urology Out Patients
Urology
Dear urology team

I have read a bit on ML Text Analysis and it seems I should be able to make a pretty good model by reviewing several hundred documents for each department (I have built an App to help me do this) and manually Classifying those documents. Some documents may mention urology but are actually addressed to another department. Typically the addressed department text is at the top third (first 3-7 lines) of the text body.

I cannot use any online tools, i.e. I can't upload any of the Document text to servers to process I need a client side library. I have read and completed several tutorials using the ML.net but these are pretty basic (sentiment, entity detection without any initial training), and read an excellent blog at MonkeyLearn: which seems to acknowledge that can do what I imagine I should be able to do.

So can anybody point me in the right direction, can I use some offline Microsoft client library to complete my task? Is there some other Open Source client library i should look at. Will I have to learn Go, or python to complete the task (currently a C# dev).

Note: I could get fairly good matches simply using SQL Text search and a bit of C# with plenty of hard coded rules, but I thought I'd try ML -- however its a nest of complications at the moment and i am going around in circles.

Many Thanks
Mike.

Accepted Answer

@Mike Shapleski I see two possible solutions for your scenario.

Extracting text from your documents using the computer vision API and passing the required text as input to Azure Text Analytics for Health API
Using Azure cognitive search to upload the documents and creating a search service and enabling specific skills on the service to extract PII data or entities

The first solution can help you achieve this and ensure everything is offline or using docker containers without uploading any of your data to any storage externally. For billing purposes the containers need to connect to a metering endpoint on Azure to bill your usage of both these services(Computer Vision API & Azure text analytics containers). Also, you can use C# client library to call the local endpoint of these containers. The setup could take time to configure docker containers and passing the PDF documents to the computer vision read API to extract text. The extracted text can then be directly used or stored, to call the text analytics for health API.

The second solution can be used to index all the documents by using the search service by having your data in the cloud or behind a firewall to index the documents and make them searchable. There are some skills that can be enabled on the search service to extract entities and other PII information but this may not extract the same data as text analytics for health. This solution can be faster to setup because you can directly query your data after uploading the documents.

If an answer is helpful, please click on or upvote which might help other community members reading this thread.

Share via

Best Approach to Clientside Machine Learning for Text Classification

0 additional answers

Your answer