Anthony J. Pennings, PhD


How Do Artificial Intelligence and Big Data Use APIs and Web Scraping to Collect Data? Implications for Net Neutrality

Posted on | January 19, 2024 | No Comments

One of the books I use in a course called EST 202 – Introduction to Science, Technology, and Society Studies is Michio Kaku’s Physics of the Future (2011). Despite its age, it’s a great starting point for teaching topics like Computers, Robotics, Nanotechnology, Space Travel, and Energy. It also has a chapter on Artificial Intelligence (AI) that I use with the caveat that it doesn’t include a major change in AI occurring around the time it was published. That was the importance of data networking for AI data collection and learning. High-speed broadband networks have become fundamental to new AI and also “Big Data” because the success of these services now depend on their ability to scour the Internet and other networked data sources to find useful information.[1]

web scraping

This post looks at how collecting information from various structured and “unstructured” data sources have become an essential process for procuring information resources for AI and Big Data.[2] In particular, it looks at two strategies that are used to search networked sources for relevant data. It then discusses some ramifications for net neutrality, a regulatory stance that seeks to avoid discrimination against data content providers, including generative AI, by Internet Service Providers (ISPs).

Broadband communications enable the transfer of data between different applications on sensors, smart devices and cloud locations, contributing to the overall effectiveness of AI models and Big Data analytics. AI encompasses various technologies and approaches, including machine learning (ML), neural networks, natural language processing, expert systems, and robotics.[See 3] Big Data technologies include tools and frameworks designed to process, store, and analyze large datasets.

Technologies like MapReduce and Hadoop at Google and Yahoo! created the programming framework that led to applications like Apache Spark, NoSQL databases, and various data warehousing solutions. These are general-purpose cluster computing systems with programs written in Scala, Java, and Python that make parallel jobs easy to write and manage. These operating engines direct workloads, perform queries, conduct analyses, and support computation graphs at a totally new scale. They work across a wide range of low-cost servers, collecting information from mobile devices, PCs, and the IoTs such as autos, cash registers, and building environmental systems. Information from these data sources becomes fodder for analysis and innovative value creation.

APIs (Application Programming Interfaces) and web scraping collect information from the data networks, including the Internet. APIs are instrumental in integrating data into AI applications and machine learning models. APIs are also crucial in facilitating Big Data collection by providing a relatively standardized way for different software applications to communicate and exchange data. Web scraping is important to both AI and Big Data as the process of extracting information from HTML and CSS-coded websites collects large volumes of usable data.

What are the Differences between Big Data and AI?

While AI and Big Data are distinct concepts, they often intersect as AI systems frequently rely on large datasets for training and learning. Big Data technologies play a crucial role in managing the data requirements of AI applications, providing the necessary infrastructure for processing and analyzing vast amounts of information needed to build and continually train AI models.

The purpose of AI is to enable digital machines to perform tasks that would typically mimic or simulate human-like intelligence. This includes areas such as natural language processing, computer vision, machine learning, and robotics. AI systems can be designed to perform specific tasks, learn from experience, and adapt to changing situations.

AI applications are diverse and can be found in areas such as virtual assistants, image and speech recognition, recommendation engines, autonomous vehicles, and healthcare diagnostics. They strive to tackle tasks such as problem-solving, learning, reasoning, perception, and language understanding.

We are far from attributing human intelligence and consciousness in AI, but data networking appears to be key to ML. Kaku (2011) suggested three traits that would be a good start to theorize consciousness in AI:

1. sensing and recognizing the environment
2. self-awareness
3. planning for the future by setting goals and plans, that is, simulating the future and plotting strategy

Accepting these characteristics, it would be useful to examine the role of online data collection on each of them and collectively in the context of AI.

The purpose of Big Data is to handle and analyze massive volumes of data to derive valuable insights and identify patterns or correlations within the data. It draws on the substantial amount of data that organizations generate, process, and store. Big Data technologies enable organizations to manage and extract value from the datasets to produce meaningful insights, identify patterns, and understand trends that can inform decision-making processes.

Big Data applications span various industries and use cases, including business analytics, financial analysis, healthcare informatics, scientific research, and predictive modeling. Big Data focuses on the efficient handling of large volumes of data that involves data storage, retrieval, processing, and analysis.

Why AI and Big Data Use APIs for Data Collection

An API is a set of rules and tools that allows developers to access the functionality or data of a web service. APIs facilitate Big Data collection and AI machine learning models by providing a communication interface for applications and data networks. APIs allow applications to interact with each other, access external services, and integrate seamlessly into broader systems. Image from [4]

For example, APIs provided by cloud platforms, such as Google Cloud AI, Microsoft Azure Cognitive Services, and Amazon AI, allow developers to access pre-trained AI models for image recognition, natural language processing, and speech recognition. APIs provided by these platforms enable AI applications to access real-time social media and video streams, including posts, comments, and user interactions.

Many online platforms, including social media, e-commerce, and financial services, offer APIs that enable developers to use machine learning capabilities without managing the underlying infrastructure. Services like Amazon SageMaker, Google Cloud AI, and Azure Machine Learning provide APIs for training, deploying, and working machine learning models.

Big Data applications use APIs to collect and funnel large volumes of data into comprehensive datasets. Many governments and organizations release datasets publicly as part of open data initiatives that produce classifications based on the input data or make predictions about human behaviors. Big Data applications can access these datasets over the Internet to support tasks like urban planning, healthcare analytics, and environmental monitoring.

Likewise, APIs are instrumental in integrating machine learning (ML) models into AI applications. APIs and web scraping can be employed to gather relevant and diverse sets of data from the Internet. For example, web scraping collects images from various sources during image recognition tasks and processes them with Convolutional Neural Networks (CNNs), a type of deep learning architecture that uses algorithms specifically for processing pixel data. CNNs consist of layers with learnable filters (kernels) that detect image patterns like edges, textures, and more complex features. CNNs automatically learn and extract hierarchical features from images that help to identify and recognize objects.

Many AI and ML platforms provide APIs that allow developers to access pre-trained AI models they can use without extensive training. These are deep learning models trained on large datasets that find patterns or makes predictions based on data to accomplish specific tasks. They can be used as is or further fine-tuned to fit an application’s particular needs. These models, often made by Google, Meta, Microsoft and NVIDIA, can perform specific tasks such as creative (art, games, media) workflow, cybersecurity, image recognition, natural language processing, and sentiment analysis.

APIs enable integrating data from diverse sources, allowing Big Data applications to pull data from multiple locations and create a comprehensive dataset. APIs are used for real-time data streaming from sources such as social media platforms, financial markets, or IoT devices. Real-time APIs enable continuous data ingestion, enabling Big Data systems to analyze and respond to events as they happen.

Big Data systems often interact with databases to collect structured data. Many databases use APIs to enable programmatic access for querying and retrieving data. This practice is common in scenarios where relational databases or NoSQL databases are part of the data collection process.

Cloud providers offer APIs to access their services and resources. Big Data applications can leverage APIs to collect and process data in cloud-based storage and analytics services. This capacity facilitates scalability and flexibility in handling large datasets.

The Internet of Things (IoT) relies on APIs to enable data collection and integration between mulitple devices, sensors, and applications. IoT devices collectively generate vast amounts of data that APIs collect and manage. For example, MQTT is a messaging protocol API designed for low-bandwidth, high-latency, or unreliable networks and is commonly used for real-time communication in IoT environments. Also, RESTful APIs are used for building scalable and stateless web services and communicate between IoT devices and backend cloud servers. IoT applications requiring data retrieval, updates, and management commonly use APIs to provide a standardized way for AI and Big Data applications to collect data from connected devices such as in home automation and smart city projects.

Some companies and services that specialize in aggregating data from various sources offer APIs for accessing their aggregated datasets. Big Data applications can use these APIs to access pre-processed and curated data relevant to their analysis such as aggregated banking data.

AI both guides and uses ETL (Extract, Transform, Load) data aggregation processes. They often use APIs as part of the extraction phase but also for data transformation and enrichment. For example, ETL data collected from one source may be enriched with additional information from another source using their respective APIs. ETL cleans and organizes raw data and prepares it for data analytics and machine learning in data warehouse environments.

APIs often include mechanisms for authentication and authorization, ensuring that only authorized users or applications can access specific data. This is crucial for maintaining data security and privacy while collecting information for Big Data analysis.

In summary, APIs provide a standardized and efficient means for Big Data applications to collect data from many sources, ranging from online platforms and databases to IoT devices and cloud services. They enable interoperability between different systems and contribute to the integration of diverse datasets for analysis and decision-making.

How AI and Big Data Use Web Scraping

AI and machine learning (ML) can utilize web scraping as a method for collecting data from websites. They use web scraping for: training datasets and machine learning, text and content analysis, market research, resume parsing, price monitoring, social media monitoring and data aggregation, image and video collection, financial data extraction, healthcare data acquisition, and weather data retrieval.

Natural Language Processing (NLP) models, a subset of AI and ML, benefit from gathering text data for training. Web scraping is used to extract textual content from websites, enabling the creation of datasets for tasks such as sentiment analysis, named entity recognition, or language modeling.

AI applications involved in market analysis or competitor tracking use web scraping to collect data from competitors’ websites. This data can be analyzed to gain insights into market trends, pricing strategies, and product features. AI applications use web scraping to monitor product prices, availability, and customer reviews from e-commerce websites. This data can inform marketing strategies and enhance recommendation algorithms.

AI-powered recruitment and job matching systems utilize web scraping to extract job postings from various websites. This acquired dataset provides a view of the job market, salary ranges, and in-demand skills. This information can be used to make informed decisions about talent acquisition, workforce planning, and skill development. Additionally, web scraping can be employed to parse resumes and extract relevant information for matching candidates with job opportunities.

AI models that analyze social media trends, sentiments, or user behavior can utilize web scraping to collect data from platforms like X, Facebook, or Instagram. This data is valuable for training models in social media analytics.

Web scraping can gather relevant and diverse datasets of imagery from the web. For image recognition tasks, web scraping can collect graphics and pictures from various sources. AI applications, especially those dealing with computer vision tasks, often use web scraping to collect image and video datasets. This is common in tasks such as object detection, image classification, and facial recognition. Full self driving (FSD) draws on imagery from cameras to label potential dangers and obstacles.

AI and ML models in finance leverage web scraping to collect financial data, news, or market updates from financial websites. This data can be used for predicting financial market trends or making investment decisions.

Some AI applications in healthcare use web scraping to collect medical literature, patient reviews, and information about healthcare providers. This data can be utilized for building models related to healthcare analytics or patient sentiment analysis.

AI models predicting weather patterns may use web scraping to collect real-time weather data from various sources, including weather websites. This data is crucial for training accurate and up-to-date weather prediction models. They are also economically efficient, allowing many news sources to gather weather information from all over the planet without having to collect it themselves.

Web scraping should be conducted responsibly and ethically, respecting the terms of service of websites and relevant legal regulations. Additionally, websites may have varying degrees of resistance to web scraping, and proper measures should be taken to ensure compliance and minimize any negative impact on the targeted websites.

Implications for Net Neutrality

I’m currently reviewing new technologies and devices to consider their implications for broadband policy. These include connected cars as part of my Automatrix series, Virtual Private Networks (VPNs), and Deep Packet Inspection (DPI). I intend to readdress broadband policy issues in light of the FCC’s new emphasis on net neutrality and take a more critical look at content providers. These platforms and websites collect huge amounts of data on human behavior to influence economic and political decisions.[5] It is too early to draw substantive conclusions about the amount of data traffic that AI will produce. Still, I wanted to explain the predominant collection processes and raise some issues.

Net neutrality principles have typically advocated equal treatment of data traffic and regulations restricting ISP discrimination against content providers operating at the Internet’s edge. The Internet and its World Wide Web (WWW) were designed to prioritize capability at the “host” level – the clouds, devices, and platforms at the network’s edges. AI also operates at the edges. Following historical and legal precedents that reach back to the telegraph and even railroads, the regulatory regime for telecommunications has been codified for the carrier to move information commodities and content with transparency and non-interference.

ISPs have pushed back in the computer age, looking to use the increasing intelligence in their telecommunications networks to extract additional value from informational exchanges. They argue the capital-intensive nature of their service provision requires them to invest in the newest technologies. They further contend that their investments can also offer value-added services that would benefit their customers, such as IPTV and search engines. Content dompetitors have complained this gives the ISPs a competitive and potentially dangerous advantage.

Although it’s early in the era of AI and Big Data data collection, we can expect that they will have a major impact on network resources. Congestion issues are a major concern for ISPs that risk losing customer confidence if traffic slows, videos buffer, and games lag. Will data collection seriously affect broadband usage? Using APIs and large-scale web scraping, particularly when conducted by big entities, might disproportionately affect network speeds. API-based data collection and web scraping practices should be mindful of their impact on the broader networked world.


[1] Pennings, A.J. (2013, Feb 15). Working Big Data – Hadoop and the Transformation of Data Processing. and Pennings, A.J. (2011, Dec 11). The New Frontier of Big Data. Image of web scraping from offering related services.

[2] Data retrieval has historically drawn from the records of structured databases. IBM has made the distinction between structured and unstructured data where structured data is sourced from “GPS sensors, online forms, network logs, web server logs, OLTP systems, etc., whereas unstructured data sources include email messages, word-processing documents, PDF files, etc.” IBM’s Watson for example, was heavily dependent on the structured information model in its early days. See Pennings, A.J. (2014, Nov 11). IBM’s Watson AI Targets Healthcare.

[3] AI encompasses various technologies and approaches, including machine learning, neural networks, natural language processing, expert systems, and robotics. Machine learning (ML), a subset of AI, involves algorithms that allow systems to learn from data. Neural networks teach computers to process data with deep learning that uses interconnected nodes or neurons in a layered structure that was inspired by the human brain. Natural language processing is machine learning technology that teaches computers to comprehend, interpret, and manipulate human language. Expert systems use AI to simulate the expertise, judgment, and experience of a human or an organization in a particular field. Robotics is the field of creating intelligent machines that can assist humans in a variety of ways.

[4] Pascal, Heus (2023, Jun 23). AI, APIs, metadata, and data: the digital knowledge and machine intelligence ecosystem.

[5] Large-scale web scraping often involves the extraction of personal data from websites, and this can raise privacy concerns. If not done responsibly, scraping personal or sensitive information might violate privacy regulations. Net neutrality discussions often extend to privacy considerations, emphasizing the need for responsible and ethical data practices. ISPs might be tempted to intervene in web scraping activities by implementing measures such as blocking or throttling, especially if the scraping activity is seen as detrimental to their networks or if it violates terms of service. Such interventions could raise questions about net neutrality, as they involve discriminatory actions against specific types of traffic.

Note: Chat GPT was used for parts of this post. Multiple prompts were used and parsed.

Citation APA (7th Edition)

Pennings, A.J. (2024, Jan 19). How Do Artificial Intelligence and Big Data Use APIs and Web Scraping to Collect Data? Implications for Net Neutrality.


AnthonybwAnthony J. Pennings, PhD is a Professor at the Department of Technology and Society, State University of New York, Korea, where he teaches broadband and cloud policy for sustainable development. From 2002 to 2012, he was on the faculty of New York University, teaching comparative political economy and digital economics. He also taught in the Digital Media MBA at St. Edwards University in Austin, Texas, where he lives when not in Korea.


Comments are closed.

  • Referencing this Material

    Copyrights apply to all materials on this blog but fair use conditions allow limited use of ideas and quotations. Please cite the permalinks of the articles/posts.
    Citing a post in APA style would look like:
    Pennings, A. (2015, April 17). Diffusion and the Five Characteristics of Innovation Adoption. Retrieved from
    MLA style citation would look like: "Diffusion and the Five Characteristics of Innovation Adoption." Anthony J. Pennings, PhD. Web. 18 June 2015. The date would be the day you accessed the information. View the Writing Criteria link at the top of this page to link to an online APA reference manual.

  • About Me

    Professor at State University of New York (SUNY) Korea since 2016. Moved to Austin, Texas in August 2012 to join the Digital Media Management program at St. Edwards University. Spent the previous decade on the faculty at New York University teaching and researching information systems, digital economics, and strategic communications.

    You can reach me at:

    Follow apennings on Twitter

  • About me

  • Writings by Category

  • Flag Counter
  • Pages

  • Calendar

    June 2024
    M T W T F S S
  • Disclaimer

    The opinions expressed here do not necessarily reflect the views of my employers, past or present.