Aritechture best practices to store data from websraping and use it in analytics

WeirdMan 220 Reputation points
2024-05-28T08:27:28.46+00:00

I am scrapping data from websites and I want to export what I scrapped as csv files then store them in Azure Data Lake, then apply an ETL and the final data output will be used for Power BI reports and for machine learning do I need to use Azure synapse? is my architecture good ? if the websites from where I am scrapping change their html structures what should i do ?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,409 questions
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,688 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,621 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,014 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 18,501 Reputation points
    2024-05-28T08:35:33.0766667+00:00

    I am breaking your question into 3 parts :

    1. Do you need to use Azure Synapse?

    Azure Synapse is a powerful tool for large-scale data integration, data warehousing, and big data analytics. However, whether you need it depends on the scale and complexity of your data operations:

    • Use Azure Synapse if:
      • You have large volumes of data that require complex transformations and querying.
      • You need to integrate data from multiple sources, not just your scraped data.
      • You want to leverage Synapse's advanced analytics capabilities, such as integration with Spark and SQL engines.
    • Alternatives:
      • ADF: For orchestrating ETL workflows without the need for a full Synapse environment. ADF can handle data ingestion, transformation, and loading into ADLS.
      • Azure Databricks: For more flexible data processing, especially if you have significant machine learning requirements. Databricks can read/write to ADLS and has strong integration with Azure services.
      • Directly using ADLS: If your data volume is moderate, you can manage transformations using other Azure services or Python scripts directly within ADLS.

    2. Is your architecture good?

    Your architecture seems good with a clear data flow from scraping to storage, transformation, and analysis.

    1. Data Scraping:
      • Scrape data from websites and export it as CSV files.
      • Ensure robust error handling and logging in your scraping scripts.
    2. Storage:
      • Store raw CSV files in ADLS
    3. ETL Process:
      • Use ADF or Azure Databricks to orchestrate the ETL process.
      • Transform raw data as needed and store the processed data back in ADLS.
    4. Analysis:
      • Connect ADLS to Power BI for reporting.
      • Use Azure Databricks or other machine learning services to train models on the transformed data.

    3. Handling Changes in Website HTML Structures

    Changes in the HTML structure of websites can break your scraping scripts. Here are some strategies to handle this:

    • Regular Monitoring: Implement monitoring to detect changes in the HTML structure. This can be as simple as periodically checking the page layout or using diff tools to compare HTML structures over time.
    • Modular Scraping Scripts: Write modular and flexible scraping scripts that can be easily updated. Use libraries like BeautifulSoup or Scrapy in Python that can adapt more easily to changes.
    • XPath and CSS Selectors: Use XPath or CSS selectors that are less likely to change, or robust selectors that can adapt to minor changes.
    • Error Handling: Implement robust error handling to log failures and notify you when scraping fails due to structural changes.
    • Fallback Mechanisms: Design fallback mechanisms to handle minor changes or missing data gracefully.

    Automated Testing: Regularly test your scraping scripts against the target websites to catch changes early.

    More links :

    https://stackoverflow.com/questions/43849569/scraping-data-from-website-that-could-change

    https://www.linkedin.com/advice/1/how-do-you-monitor-update-your-scraping-scripts-pipelines

    https://www.reddit.com/r/dataengineering/comments/10b7wof/web_scraping_how_to_deal_with_changing_html/

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful