La API Web Extractor es una herramienta robusta diseñada para recuperar texto limpio y estructurado de páginas web. Con sus dos puntos finales especializados, permite a los usuarios extraer contenido significativo sin anuncios, elementos de navegación u otros detalles irrelevantes. Además, su punto final de conversión a markdown convierte páginas web en documentos markdown estructurados, ideales para bloguear, gestión de contenido e integración con plataformas basadas en markdown. Compatible con páginas estáticas y dinámicas, esta API se adapta a estructuras web complejas para garantizar resultados consistentes y de alta calidad.
Para usar este endpoint, envíe una solicitud con la URL de la página web y reciba el texto limpio extraído del contenido de esa página.
Recuperar texto limpio - Características del Endpoint
| Objeto | Descripción |
|---|---|
Cuerpo de la Solicitud |
[Requerido] Json |
{"response":"Spark Basics\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. Distributed Computing removes this limitation of vertical scaling by distributing the processing across cluster of machines. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\nSpark Basics\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\nSpark timeline\nGoogle was first to introduce large scale distributed computing solution with MapReduce and its own distributed file system i.e., Google File System(GFS). GFS provided a blueprint for the Hadoop File System (HDFS), including the MapReduce implementation as a framework for distributed computing. Apache Hadoop framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the Spark. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for machine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\nSpark Application\nSpark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster. The executors are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\nThere is a SparkSession object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\nSpark’s language APIs make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the low-level “unstructured” APIs (RDDs), and the higher-level structured APIs (Dataframes, Datasets).\nSpark Toolsets\nA DataFrame is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster.\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a transformation and if it doesn’t return anything then it’s an action. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\nTransformation are of types narrow and wide. Narrow transformations are those for which each input partition will contribute to only one output partition. Wide transformation will have input partitions contributing to many output partitions.\nSparks performs a lazy evaluation which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\nSpark-submit\nReferences\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5660/web+extractor+api/7369/retrieve+clean+text' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
Para utilizar este endpoint, envíe una solicitud con la URL de la página web y reciba el contenido convertido al formato markdown de esa página.
Extracción de contenido de texto - Características del Endpoint
| Objeto | Descripción |
|---|---|
Cuerpo de la Solicitud |
[Requerido] Json |
{"response":"---\ntitle: Spark Basics\nurl: https://techtalkverse.com/post/software-development/spark-basics/\nhostname: techtalkverse.com\ndescription: Suppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nsitename: techtalkverse.com\ndate: 2023-05-01\ncategories: ['post']\n---\n# Spark Basics\n\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\n\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. **Distributed Computing** removes this limitation of vertical scaling by distributing the processing across cluster of machines.\nNow, a group of machines alone is not powerful, you need a framework to\ncoordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\n\n## Spark Basics\n\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\n\n```\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\n```\n\n\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\n\n## Spark timeline\n\nGoogle was first to introduce large scale distributed computing solution with **MapReduce** and its own distributed file system i.e., **Google File System(GFS)**. GFS provided a blueprint for the **Hadoop File System (HDFS)**, including the MapReduce implementation as a framework for distributed computing. **Apache Hadoop** framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the **Spark**. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for\nmachine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\n\n## Spark Application\n\n**Spark Applications** consist of a driver process and a set of executor processes. The **driver** process runs your main() function, sits on a node in the cluster. The **executors** are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\n\nThere is a **SparkSession** object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\n**Spark’s language APIs** make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the **low-level “unstructured” APIs** (RDDs), and the **higher-level structured APIs** (Dataframes, Datasets).\n\n## Spark Toolsets\n\nA **DataFrame** is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A **partition** is a collection of rows that sit on one physical machine in your cluster.\n\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a **transformation** and if it doesn’t return anything then it’s an **action**. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\n\nTransformation are of types narrow and wide. **Narrow transformations** are those for which each input partition will contribute to only one output partition. **Wide transformation** will have input partitions contributing to many output partitions.\n\nSparks performs a **lazy evaluation** which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\n\n## Spark-submit\n\n## References\n\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5660/web+extractor+api/7370/text+content+extract' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
| Encabezado | Descripción |
|---|---|
Autorización
|
[Requerido] Debería ser Bearer access_key. Consulta "Tu Clave de Acceso a la API" arriba cuando estés suscrito. |
Sin compromiso a largo plazo. Mejora, reduce o cancela en cualquier momento. La Prueba Gratuita incluye hasta 50 solicitudes.
La API de Extracción Web está diseñada para extraer texto limpio y estructurado, así como markdown, de páginas web, permitiendo a los usuarios analizar, documentar o mostrar contenido sin detalles irrelevantes como anuncios o elementos de navegación.
La API es compatible con páginas web estáticas y dinámicas, adaptándose a estructuras web complejas para garantizar resultados consistentes y de alta calidad durante la extracción de contenido.
Las características principales incluyen dos puntos finales especializados para extraer texto limpio y convertir páginas web en documentos markdown estructurados, lo que lo hace adecuado para blogs y gestión de contenido.
Sí, la API de Web Extractor está diseñada para manejar estructuras web complejas, asegurando que recupera contenido significativo con precisión, sin importar el diseño de la página.
El endpoint de conversión a markdown formatea el contenido extraído en documentos de markdown estructurados, que se pueden integrar fácilmente con plataformas basadas en markdown para una gestión de contenido sin interrupciones.
El endpoint "Recuperar texto limpio" devuelve texto sin formato extraído de una página web, mientras que el endpoint "Extraer contenido de texto" devuelve markdown estructurado, incluyendo metadatos como título, URL, descripción y categorías.
La respuesta de "Recuperar Texto Limpio" contiene el contenido de texto limpio. La respuesta de "Extracción de Contenido de Texto" incluye campos como título, URL, nombre de host, descripción, nombre del sitio, fecha y categorías, proporcionando un contexto completo para el contenido extraído.
La respuesta "Recuperar Texto Limpio" es un objeto JSON simple con una clave "response" que contiene el texto. La respuesta "Extracción de Contenido de Texto" es un objeto JSON más complejo con múltiples claves, incluyendo metadatos y el contenido en markdown, estructurado para una fácil integración.
El endpoint "Recuperar Texto Limpio" proporciona el contenido textual principal de una página web, mientras que el endpoint "Extraer Contenido de Texto" ofrece tanto el texto como metadatos adicionales, como el título de la página, la URL y las categorías, mejorando el contexto del contenido.
Los usuarios pueden personalizar las solicitudes especificando diferentes URL para la extracción. La API procesa la URL proporcionada para devolver el texto limpio o markdown relevante, permitiendo flexibilidad en la obtención de contenido.
Los casos de uso típicos incluyen análisis de contenido, documentación y blogs. Los usuarios pueden extraer texto limpio para investigación o convertir páginas web a markdown para una fácil integración en sistemas de gestión de contenido o blogs.
La API de Web Extractor emplea algoritmos para analizar y extraer contenido relevante mientras filtra anuncios y elementos de navegación, asegurando una alta precisión en los datos extraídos. Las actualizaciones continuas de la lógica de extracción ayudan a mantener la calidad.
Los usuarios pueden esperar que el endpoint "Recuperar Texto Limpio" devuelva párrafos coherentes de texto, mientras que el "Extraer Contenido de Texto" generará markdown estructurado con metadatos claros. Ambas salidas están diseñadas para ser limpias y fáciles de usar en diversas aplicaciones.
Nivel de Servicio:
100%
Tiempo de Respuesta:
10.154ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
2.381ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
11.307ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
1.374ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
10.448ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
1.332ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
2.680ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
878ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
3.882ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
764ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
169ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
315ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
182ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
256ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
482ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
609ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
193ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
603ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
1.291ms
Nivel de Servicio:
100%
Tiempo de Respuesta:
202ms