Recursos de programación de apache
Data comes at us fast” is what they say. In fact, the last couple of years taught us how to successfully cleanse, store, retrieve, process, and visualize large amounts of data in a batch or streaming way. Despite these advances, data sharing has been severely limited because sharing solutions were tied to a single vendor, did not work for live data, came with severe security issues, and did not scale to the bandwidth of modern cloud object stores. Conferences have been filled for many years with sessions about how to architect applications and master the APIs of your services, but recent events have shown a huge business demand for sharing massive amounts of live data in the most direct scalable way possible. One example is open data sets of genomic data shared publicly for the development of vaccines. Still, many commercial use cases share news, financial or geological data to a restricted audience where the data has to be secured. In this session, dive deep into an open source solution for sharing massive amounts of live data in a cheap, secure, and scalable way. Delta sharing is an open source project donated to the Linux Foundation. It uses an open REST protocol to secure the real-time exchange of large data sets, enabling secure data sharing across products for the first time. It leverages modern cloud object stores, such as S3, ADLS, or GCS, to reliably transfer large data sets. There are two parties involved: Data Providers and Recipients. The data provider decides what data to share and runs a sharing server. An open-sourced reference sharing service is available to get started for sharing Apache Parque or Delta.io tables. Any client supporting pandas, Apache Spark™, Rust, or Python, can connect to the sharing server. Clients always read the latest version of the data, and they can provide filters on the data (e.g., “country=ES”) to read a subset of the data. Since the data is presented as pandas or Spark dataframes the integration with ML frameworks such as MLflow or Sagemaker is seamless.
As individuals, we use time series data in everyday life all the time; If you’re trying to improve your health, you may track how many steps you take daily, and relate that to your body weight or size over time to understand how well you’re doing. This is clearly a small-scale example, but on the other end of the spectrum, large-scale time series use cases abound in our current technological landscape. Be it tracking the price of a stock or cryptocurrency that changes every millisecond, performance and health metrics of a video streaming application, sensors for reading temperature, pressure and humidity, or the information generated from millions of IoT devices. Modern digital applications require collecting, storing, and analyzing time series data at extreme scale, and with performance that a relational database simply cannot provide. We have all seen very creative solutions built to work around this problem, but as throughput needs increase, scaling them becomes a major challenge. To get the job done, developers end up landing, transforming, and moving data around repeatedly, using multiple components pipelined together. Looking at these solutions really feels like looking at Rube Goldberg machines. It’s staggering to see how complex architectures become in order to satisfy the needs of these workloads. Most importantly, all of this is something that needed to be built, managed, and maintained, and it still doesn’t meet very high scale and performance needs. Many time series applications can generate enormous volumes of data. One common example here is video streaming. The act of delivering high quality video content is a very complex process. Understanding load latency, video frame drops, and user activity is something that needs to happen at massive scale and in real time. This process alone can generate several GBs of data every second, while easily running hundreds of thousands, sometimes over a million, queries per hour. A relational database certainly isn’t the right choice here. Which is exactly why we built Timestream at AWS. Timestream started out by decoupling data ingestion, storage, and query such that each can scale independently. The design keeps each sub-system simple, making it easier to achieve unwavering reliability, while also eliminating scaling bottlenecks, and reducing the chances of correlated system failures which becomes more important as the system grows. At the same time, in order to manage overall growth, the system is cell based – rather than scale the system as a whole, we segment the system into multiple smaller copies of itself so that these cells can be tested at full scale, and a system problem in one cell can’t affect activity in any of the other cells. In this session, I will introduce the problem of time-series, I will take a look at some architectures that have been used it the past to work around the problem, and I will then introduce Amazon Timestream, a purpose-built database to process and analyze time-series data at scale. In this session I will describe the time-series problem, discuss the architecture of Amazon Timestream, and demo how it can be used to ingest and process time-series data at scale as a fully managed service. I will also demo how it can be easily integrated with open source tools like Apache Flink or Grafana.
CDC es un conjunto de patrones que nos permite detectar cambios en una fuente de datos y actuar sobre ellos. En este webinar vamos a ver una de las implementaciones reactivas de CDC basada en Debezium, que nos permitirá replicar los cambios producidos sobre un sistema Legacy basado en DB2 y Oracle a un bus de eventos Apache Kafka en tiempo real, con la finalidad de poder realizar una transformación digital del sistema actual. Repositorio: https://github.com/paradigmadigital/debezium ¿Quiénes son los ponentes? Jesús Pau de la Cruz. Soy Ingeniero Informático por la Universidad Rey Juan Carlos y me encanta la tecnología y las posibilidades que ofrece al mundo. Interesado en el diseño de soluciones Real-time, arquitecturas distribuidas y escalables y entornos Cloud. Actualmente trabajo en Paradigma como Arquitecto Software. José Alberto Ruiz Casarrubios. Ingeniero informático de vocación, todoterreno de la tecnología y aprendiz incansable. Estoy siempre buscando nuevos retos a los que intentar aportar la mejor solución. Inmerso de lleno en el mundo del desarrollo de software y modernización de sistemas. Creyente de que aplicar el sentido común es la mejor de las metodologías y decisiones.
Una de las arquitecturas que está creciendo en uso debido a la popularidad de los microservicios es Event-Driven Architecture (EDA). Haciendo uso de patrones como Event Sourcing y Event Collaboration, permite desacoplar los microservicios y facilita la operación de los mismos. Sin embargo, al igual que con la comunicación síncrona, debe haber acuerdos entre consumidores y productores para garantizar que no se rompa la compatibilidad. En esta charla, Antón compartirá su experiencia construyendo este tipo de arquitecturas y, en concreto, los problemas a los que se ha enfrentado a la hora de gobernar esos acuerdos en arquitecturas que se expanden a varios datacenters y diferentes nubes. Contará el camino recorrido para integrar Kafka, Azure EventHub o Google PubSub usando tecnologías como Kafka Connect y Google Dataflow. #Sobre el ponente (Antón R. Yuste) I’m a Principal Software Engineer focused on Event Streaming and Real-Time Processing. I’ve experience working with different message brokers and event streaming platforms (Apache Kafka, Apache Pulsar, Google Pub/Sub and Azure EventHub) and real-time processing frameworks (Flink, Kafka Streams, Spark Structured Streaming, Google Dataflow, Azure Stream Analytics, etc.). During my career, I specialized in building internal SaaS in big corporations to make complex technologies easily used and adopted by teams so they can build solutions to real business use cases. From the very beginning, I can help with governance, operation, performance, adoption, training and any task related to system administration or backend development.
This talk examines business perspectives about the Ray Project from RISELab, hailed as a successor to Apache Spark. Ray is a simple-to-use open source library in Python or Java, which provides multiple patterns for distributed systems: mix and match as needed for a given business use case – without tight coupling of applications with underlying frameworks. Warning: this talk may change the way your organization approaches AI. #BIGTH20 #RayProject Session presented at Big Things Conference 2020 by Paco Nathan, Managing Partner at Derwen 16th November 2020 Home Edition
Machine Learning (ML) is separated into model training and model inference. ML frameworks typically use a data lake like HDFS or S3 to process historical data and train analytic models. Model inference and monitoring at production scale in real time is another common challenge using a data lake. But it’s possible to completely avoid such a data store, using an event streaming architecture. This talk compares the modern approach to traditional batch and big data alternatives and explains benefits like the simplified architecture, the ability of reprocessing events in the same order for training different models, and the possibility to build a scalable, mission-critical ML architecture for real time predictions with muss less headaches and problems. The talk explains how this can be achieved leveraging Apache Kafka, Tiered Storage and TensorFlow. Session presented at Big Things Conference 2020 by KAI WAEHNER Field CTO, Confluent 18th November 2020 Home Edition Do you want to know more? https://www.bigthingsconference.com/
La inmensa mayoría del contenido que se crea diariamente en Internet es desestructurado. Aproximadamente el 90% del mismo es texto. En la era de la web colaborativa, usamos el lenguaje constantemente, por ejemplo, para escribir una crítica de un producto, comentar una foto o escribir un tweet. En esta charla veremos algunas de las herramientas que ofrece el ecosistema Python para comprender, estructurar y extraer valor de un texto y veremos cómo el enfoque a la hora de atacar tareas de procesamiento de texto ha ido evolucionando en los últimos años hasta la tendencia actual basada en Transfer Learning. Además, lo haremos a través de un caso de uso concreto: detectar comentarios ofensivos o insultos a otros usuarios en redes sociales o foros. Bio: Rafa Haro trabaja actualmente como Search Architect en Copyright Clearance Center. Durante sus más de 14 años de experiencia en el desarrollo de software, ha trabajado principalmente en empresas relacionadas con el Procesamiento de Lenguaje Natural, Tecnologías Semánticas y Búsqueda Inteligente. Participa activamente además con diversas comunidades Open Source como Apache Software Foundation dónde es committer y PMC member de dos proyectos: Apache Stanbol y Apache Manifold.
In this session, we will demonstrate how common vulnerabilities in the Java and JavaScript eco-system are exploited on a daily base by live hacking real-world application libraries. All the examples used are commonly known exploits, some more famous than others, such as Apache Struts and Spring break remote code execution vulnerabilities. By exploiting them and showing you how you can be attacked, before showing you how to protect yourself, you will have a better understanding of why and how security focus and DevSecOps is essential for every developer. About: Brian Vermeer, Developer Advocate - Snyk Developer Advocate for Snyk and Software Engineer with over 10 years of hands-on experience in creating and maintaining Software. He is passionate about Java, (Pure) Functional Programming and Cybersecurity. Brian is an Oracle Groundbreaker Ambassador and regular international speaker on mostly Java-related conferences like JavaOne, Oracle Code One, Devoxx BE, Devoxx UK, JFokus, JavaZone and many more. Besides all that Brian is a military reserve for the Royal Netherlands Air Force and a Taekwondo Master / Teacher.
Processing the unbounded streams of data in a distributed system sounds like a challenge. Fortunately, there is a tool that can make your way easier. Łukasz will share his experience as a "Storm Trooper", a user of Apache Storm framework, announced to be a first streaming engine to break the 1-microsecond latency barrier. His story will start by describing the processing model. He'll tell you how to build your distributed application using spouts, bolts, and topologies. Then he'll move to components that make your apps work in a distributed way. That's the part when three guys: Nimbuses, Supervisors, and Zookeepers join in and help to build a cluster. As a result, he'll be able to show you a demo app, running on Apache Storm. As you know, Storm Troopers are famous for missing targets. Łukasz will sum up the talk by sharing the drawbacks and ideas that he missed when he first met this technology. After the presentation, you can start playing with processing streams or compare your current approach with the Apache Storm model. And who knows, maybe you'll become a Storm Trooper.
Fundada por los creadores de Apache Kafka, Confluent ha construido una plataforma de Event Streaming que permite a las empresas acceder fácilmente a sus datos en forma de Streams en Tiempo Real. La Plataforma de Confluent es "Apache Kafka on Steroids": teniendo como base Apache Kafka, Confluent ofrece todas las funcionalidades que se necesitan para un despliegue productivo, crítico y seguro. Planteamos una sesión donde haremos una introducción al mundo del streaming de eventos. Desde sus capacidades de integración con sistemas de terceros (AWS, Hadoop, Elastic, Mongo, Debezium, MQTT, JMS ... ), pasando por sus capacidades de procesamiento con Kafka Streams y ksqlDB hasta su gestión y despliegue tanto en entornos Kubernetes como en formato SaaS en Confluent Cloud.