Reddit packetstream

9/14/2023 0 Comments

Reddit packetstream

We tried hard to respect robots.txt, but we still got angry cease-and-desist emails from people who’d malformed or misconfigured the file. It is an arms race, since many people don’t want you to scrape. Ward in particular was great at prototyping solutions. We were scraping 100 million domains once a week with a team of six engineers, one UX (me), and Ward Fucking Cunningham as wiki expert. In many ways it felt like a satisfying old-skool problem: scrape, find edge case, patch it, scrape again. I worked at for a while, and that’s what we did. There are definitely some other complexities you could resolve, but it's hard to imagine that those benefits would be enough to build a business on. So effectively your business is providing nothing more than URL lists and potentially some additional metadata compared to what the customer would get if they fetched the pages themselves. Which means that you need to deliver the web pages to them. The important bit is always processing the web pages for whatever content is salient to the given customer. The number of customers who a) need to scrape the web b) aren't sophisticated enough to do it themselves and c) are big enough to make $$$ from are small. I think it's incredibly difficult to build a profitable business in this space.

It is like saying that a person is, in essence, the world's largest mouth. It's beyond reductive to say that Google is in essence a large web scraper, or a web scraper of any kind. > Google, a trillion dollar company, is essentially the world's largest web scraper.Įven just considering the parts of Google that it takes to bring you the N blue links part of the Google SERP, the web scraper is probably the least interesting and significant piece of technology in the stack. I understand that there's not much context now and one could easily say "well yeah, anything could be possible with a good team, product.", but I'm reaching out to the HN community to gather some considerations, mental models and pointers, I may not think of myself at this point. What I'm afraid of is getting sidetracked in a discussion of "um, this is web scraping and it's hard to make a business on top of it". I have done some preliminary analysis of the space of potential competitors (think import.io, Apify, Zyte/ScarpingHub, etc.) and described opportunities for differentiation. The most obvious one: as one can imagine, the product it's not magic and, after a certain point it does require some manual work from the customer, hence this is an aspect I should prepare for. I'll soon have an MVP and will start pitching to investors: aiming for an open-source business model (after a few months of stealth development) and eventually a typical SaaS offering for extra functionality.Īt this point I'm trying to consolidate and counter the steel-man counter-arguments I should expect from investors. Without getting into the details yet: it aims to make web data collection a little bit easier for non-devs. Now I've taken some time off from work and gigs and I'm working on a side-project I've been hacking for some time. Having said that I always was happy to code and evolve crawlers and web scrapers. I'm a data engineer at heart, and I never did or enjoyed front-end work.

0 Comments

YOUR CART

Reddit packetstream

Leave a Reply.

Author

Archives

Categories