Software Development Archives

In News

3 open source NLP tools for data extraction

August 9, 2023 No Comments

Unstructured text and data are like gold for business applications and the company bottom line, but where to start? Here are three tools worth a look.

Developers and data scientists use generative AI and large language models (LLMs) to query volumes of documents and unstructured data. Open source LLMs, including Dolly 2.0, EleutherAI Pythia, Meta AI LLaMa, StabilityLM, and others, are all starting points for experimenting with artificial intelligence that accepts natural language prompts and generates summarized responses.

“Text as a source of knowledge and information is fundamental, yet there aren’t any end-to-end solutions that tame the complexity in handling text,” says Brian Platz, CEO and co-founder of Fluree. “While most organizations have wrangled structured or semi-structured data into a centralized data platform, unstructured data remains forgotten and underleveraged.”

If your organization and team aren’t experimenting with natural language processing (NLP) capabilities, you’re probably lagging behind competitors in your industry. In the 2023 Expert NLP Survey Report, 77% of organizations said they planned to increase spending on NLP, and 54% said their time-to-production was a top return-on-investment (ROI) metric for successful NLP projects.

Use cases for NLP

If you have a corpus of unstructured data and text, some of the most common business needs include

Entity extraction by identifying names, dates, places, and products
Pattern recognition to discover currency and other quantities
Categorization into business terms, topics, and taxonomies
Sentiment analysis, including positivity, negation, and sarcasm
Summarizing the document’s key points
Machine translation into other languages
Dependency graphs that translate text into machine-readable semi-structured representations

Sometimes, having NLP capabilities bundled into a platform or application is desirable. For example, LLMs support asking questions; AI search engines enable searches and recommendations; and chatbots support interactions. Other times, it’s optimal to use NLP tools to extract information and enrich unstructured documents and text.

Let’s look at three popular open source NLP tools that developers and data scientists are using to perform discovery on unstructured documents and develop production-ready NLP processing engines.

Natural Language Toolkit

The Natural Language Toolkit (NLTK), released in 2001, is one of the older and more popular NLP Python libraries. NLTK boasts more than 11.8 thousand stars on GitHub and lists over 100 trained models.

“I think the most important tool for NLP is by far Natural Language Toolkit, which is licensed under Apache 2.0,” says Steven Devoe, director of data and analytics at SPR. “In all data science projects, the processing and cleaning of the data to be used by algorithms is a huge proportion of the time and effort, which is particularly true with natural language processing. NLTK accelerates a lot of that work, such as stemming, lemmatization, tagging, removing stop words, and embedding word vectors across multiple written languages to make the text more easily interpreted by the algorithms.”

NLTK’s benefits stem from its endurance, with many examples for developers new to NLP, such as this beginner’s hands-on guide and this more comprehensive overview. Anyone learning NLP techniques may want to try this library first, as it provides simple ways to experiment with basic techniques such as tokenization, stemming, and chunking.

spaCy

spaCy is a newer library, with its version 1.0 released in 2016. spaCy supports over 72 languages and publishes its performance benchmarks, and it has amassed more than 25,000 stars on GitHub.

“spaCy is a free, open-source Python library providing advanced capabilities to conduct natural language processing on large volumes of text at high speed,” says Nikolay Manchev, head of data science, EMEA, at Domino Data Lab. “With spaCy, a user can build models and production applications that underpin document analysis, chatbot capabilities, and all other forms of text analysis. Today, the spaCy framework is one of Python’s most popular natural language libraries for industry use cases such as extracting keywords, entities, and knowledge from text.”

Tutorials for spaCy show similar capabilities to NLTK, including named entity recognition and part-of-speech (POS) tagging. One advantage is that spaCy returns document objects and supports word vectors, which can give developers more flexibility for performing additional post-NLP data processing and text analytics.

Spark NLP

If you already use Apache Spark and have its infrastructure configured, then Spark NLP may be one of the faster paths to begin experimenting with natural language processing. Spark NLP has several installation options, including AWS, Azure Databricks, and Docker.

“Spark NLP is a widely used open-source natural language processing library that enables businesses to extract information and answers from free-text documents with state-of-the-art accuracy,” says David Talby, CTO of John Snow Labs. “This enables everything from extracting relevant health information that only exists in clinical notes, to identifying hate speech or fake news on social media, to summarizing legal agreements and financial news.

Spark NLP’s differentiators may be its healthcare, finance, and legal domain language models. These commercial products come with pre-trained models to identify drug names and dosages in healthcare, financial entity recognition such as stock tickers, and legal knowledge graphs of company names and officers.

Talby says Spark NLP can help organizations minimize the upfront training in developing models. “The free and open source library comes with more than 11,000 pre-trained models plus the ability to reuse, train, tune, and scale them easily,” he says.

Best practices for experimenting with NLP

Earlier in my career, I had the opportunity to oversee the development of several SaaS products built using NLP capabilities. My first NLP was an SaaS platform to search newspaper classified advertisements, including searching cars, jobs, and real estate. I then led developing NLPs for extracting information from commercial construction documents, including building specifications and blueprints.

When starting NLP in a new area, I advise the following:

Begin with a small but representable example of the documents or text.
Identify the target end-user personas and how extracted information improves their workflows.
Specify the required information extractions and target accuracy metrics.
Test several approaches and use speed and accuracy metrics to benchmark.
Improve accuracy iteratively, especially when increasing the scale and breadth of documents.
Expect to deliver data stewardship tools for addressing data quality and handling exceptions.

You may find that the NLP tools used to discover and experiment with new document types will aid in defining requirements. Then, expand the review of NLP technologies to include open source and commercial options, as building and supporting production-ready NLP data pipelines can get expensive. With LLMs in the news and gaining interest, underinvesting in NLP capabilities is one way to fall behind competitors. Fortunately, you can start with one of the open source tools introduced here and build your NLP data pipeline to fit your budget and requirements.

Feature Image Credit: TippaPatt/Shutterstock

By Isaac Sacolick

Isaac Sacolick is president of StarCIO and the author of the Amazon bestseller Driving Digital: The Leader’s Guide to Business Transformation through Technology and Digital Trailblazer: Essential Lessons to Jumpstart Transformation and Accelerate Your Technology Leadership. He covers agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO and digital transformation influencer. He has published more than 900 articles at InfoWorld.com, CIO.com, his blog Social, Agile, and Transformation, and other sites.

Sourced from InfoWorld

In News

16 irresistible cloud innovations

February 2, 2022 No Comments

By Martin Heller

Behind the pay-as-you-go pricing model, the public cloud is teeming with the latest and greatest development, devops, and AI tools for building better and smarter applications faster.

When we think of the public cloud, often the first consideration that comes to mind is financial: Moving workloads from near-capacity data centres to the cloud reduces capital expenditures (CapEx) but increases operating expenditures (OpEx). That may or may not be attractive to the CFO, but it isn’t exactly catnip for developers, operations, or those who combine the two as devops.

For these people, cloud computing offers many opportunities that simply aren’t available when new software services require the purchase of new server hardware or enterprise software suites. What takes six months to deploy on-premises can sometimes take 10 minutes in the cloud. What requires signatures from three levels of management to create on-prem can be charged to a credit card in the cloud.

It’s not just a matter of time and convenience. The cloud also enables higher velocity for software development, which often leads to lower time to market. The cloud can also allow for more experimentation, which often leads to higher software quality.

In addition, there are real innovations in the cloud that can provide immediate benefits and solve long-standing problems with on-premises computing. Here we present 16 compelling cloud capabilities.

Compute instances on demand

Need a new database on its own on-premises server? Get in line, and prepare to wait for months if not years. If you can tolerate having an on-prem virtual machine (VM) instead of a physical server and your company uses VMware or similar technologies, your wait might only take weeks. But if you want to create a server instance on a public cloud, you can have it provisioned and running in about 15 minutes – and you’ll be able to size it to your needs, and turn it off when you’re not using it.

Pre-built virtual machine images

Being able to bring up a VM with the operating system of your choice is convenient, but then you still need to install and license the applications you need. Being able to bring up a VM with the operating system and applications of your choice all ready to run is priceless.

Serverless services

“Serverless” means that a service or piece of code will run on demand for a short time, usually in response to an event, without needing a dedicated VM on which to run. If a service is serverless, then you typically don’t need to worry about the underlying server at all; resources are allocated out of a pool maintained by the cloud provider.

Serverless services, currently available on every major public cloud, typically feature automatic scaling, built-in high availability, and a pay-for-value billing model. If you want a serverless app without being locked into any specific public cloud, you could use a vendor-neutral serverless framework such as Kubeless, which only requires a Kubernetes cluster (which is available as a cloud service; see below).

Containers on demand

A container is a lightweight executable unit of software, much lighter than a VM. A container packages application code and its dependencies, such as libraries. Containers share the host machine’s operating system kernel. Containers can run on Docker Engine or on a Kubernetes service. Running containers on demand has all the advantages of running VMs on demand, with the additional advantages of requiring fewer resources and costing less.

Pre-built container images

A Docker container is an executable instance of a Docker image, which is specified by a Dockerfile. A Dockerfile contains the instructions for building an image, and is often based on another image. For example, an image containing Apache HTTP Server might be based on an Ubuntu image. You can find pre-defined Dockerfiles in the Docker registry, and you can also build your own. You can run Docker images in your local installation of Docker, or in any cloud with container support. As with pre-built virtual machine images, a Dockerfile can bring up a full application quickly, but unlike VM images Dockerfiles are vendor-agnostic.

Kubernetes container orchestration

Kubernetes (K8s) is an open source system for automating deployment, scaling, and management of containerized applications. K8s was based on Google’s internal “Borg” technology. K8s clusters consist of a set of worker machines, called nodes, that run containerized applications. Worker nodes host pods, which contain applications; a control plane manages the worker nodes and pods. K8s runs anywhere and scales without bounds. All major public clouds have K8s services; you can also run K8s on your own development machine.

Auto-scaling servers

You don’t have to containerize your applications and run them under Kubernetes to automatically scale them in the cloud. Most public clouds allow you to automatically scale virtual machines and services up (or down) as driven by usage, either by adding (or subtracting) instances or increasing (or decreasing) the instance size.

Planetary databases

The major public clouds and several database vendors have implemented planet-scale distributed databases with underpinnings such as data fabrics, redundant interconnects, and distributed consensus algorithms that enable them to work efficiently and with up to five 9’s reliability (99.999% uptime). Cloud-specific examples include Google Cloud Spanner (relational), Azure Cosmos DB (multi-model), Amazon DynamoDB (key-value and document), and Amazon Aurora (relational). Vendor examples include CockroachDB (relational), PlanetScale (relational), Fauna (relational/serverless), Neo4j (graph), MongoDB Atlas (document), DataStax Astra (wide-column), and Couchbase Cloud (document).

Hybrid services

Companies with large investments in data centres often want to extend their existing applications and services into the cloud rather than replace them with cloud services. All the major cloud vendors now offer ways to accomplish that, both by using specific hybrid services (for example, databases that can span data centres and clouds) and on-premises servers and edge cloud resources that connect to the public cloud, often called hybrid clouds.

Scalable machine learning training and prediction

Machine learning training, especially deep learning, often requires substantial compute resources for hours to weeks. Machine learning prediction, on the other hand, needs its compute resources for seconds per prediction, unless you’re doing batch predictions. Using cloud resources is often the most convenient way to accomplish model training and predictions.

Cloud GPUs, TPUs, and FPGAs

Deep learning with large models and the very large datasets needed for accurate training can often take much more than a week on clusters of CPUs. GPUs, TPUs, and FPGAs can all cut training time down significantly, and having them available in the cloud makes it easy to use them when needed.

Pre-trained AI services

Many AI services can be performed well by pre-trained models, for example language translation, text to speech, and image identification. All the major cloud services offer pre-trained AI services based on robust models.

Customizable AI services

Sometimes pre-trained AI services don’t do exactly what you need. Transfer learning, which trains only a few neural network layers on top of an existing model, can give you a customized service relatively quickly compared to training a model from scratch. Again, all the major cloud service providers offer transfer learning, although they don’t all call it by the same name.

Monitoring services

All clouds support at least one monitoring service and make it easy for you to configure your cloud services for monitoring. The monitoring services often show you a graphical dashboard, and can be configured to notify you of exceptions and unusual performance indicators.

Distributed services

Databases aren’t the only services that can benefit from running in a distributed fashion. The issue is latency. If compute resources are far from the data or from the processes under management, it takes too long to send and receive instructions and information. If latency is too high in a feedback loop, the loop can easily go out of control. If latency is too high between machine learning and the data, the time it takes to perform the training can blow up. To solve this problem, cloud service providers offer connected appliances that can extend their services to a customer’s data centres (hybrid cloud) or near a customer’s factory floors (edge computing).

Edge computing

The need to bring analysis and machine learning geographically close to machinery and other real-world objects (the Internet of Things, or IoT) has led to specialized devices, such as miniature compute devices with GPUs and sensors, and architectures to support them, such as edge servers, automation platforms, and content delivery networks. Ultimately, these all connect back to the cloud, but the ability to perform analysis at the edge can greatly decrease the volume of data sent to the cloud as well as reducing the latency.

The next time you hear grief about your cloud spending, perhaps you can point to one of these 16 benefits – or to one of the cloud features that have helped you or your team. Any one of the cloud innovations we’ve discussed can justify its use. Taken together, the benefits really are irresistible.

Feature Image Credit: Deyan Georgiev / Shutterstock