I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.
Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.
nthingtohide 26 minutes ago [-]
> an innocent machine about wanking and divorce
Let's say you discovered a pendrive of a long lost civilization and train a model on that text data. How would you or the model know that the pendrive contained data on wanking and divorce without anykind of external grounding to that data?
falcor84 4 hours ago [-]
What's wrong with wanking and divorce? These are respectively a way for people to be happier and more self-reliant, and a way for people to get out of a situation that isn't working out for them. I think both are net positives, and I'm very grateful to live in a society that normalizes them.
dcuthbertson 4 hours ago [-]
The innocent machine can't do either. It's akin to having no mouth, but it must scream (apologies to Harlan Ellison)
falcor84 1 hours ago [-]
That is a fair point, but it would then apply to everything else we teach it about, like how we perceive the color of the sky or the taste of champagne. Should we remove these from the training set too?
Is it not still good to be exposed to the experiences of others, even if one cannot experience these things themself?
montebicyclelo 7 hours ago [-]
There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.
- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`
- ClickHouse, no signup needed, can run queries in browser directly, [1]
The ClickHouse resource is amazing. It even has history! I had already done my own exercise of downloading all the JSON before discovering the Clickhouse HN DBs.
bambax 6 hours ago [-]
> Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.
The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?
kriro 6 minutes ago [-]
I think LLMs could be a great driver of private-public key encryption. I could see a future where everyone finally wants to sign their content. Then at least we know it's from that person or an LLM-agent by that person.
Maybe that'll be a use case for blockchain tech. See the whole posting history of the account on-chain.
icoder 5 hours ago [-]
I'm more and more convinced of an old idea that seems to become more relevant over time: to somehow form a network of trust between humans so that I know that your account is trusted by a person (you) that is trusted by a person (I don't know) [...] that is trusted by a person (that I do know) that is trusted by me.
Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).
Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.
Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.
haswell 2 hours ago [-]
I’ve also been thinking about this quite a bit lately.
I also want something like this for a lightweight social media experience. I’ve been off of the big platforms for years now, but really want a way to share life updates and photos with a group of trusted friends and family.
The more hostile the platforms become, the more viable I think something like this will become, because more and more people are frustrated and willing to put in some work to regain some control of their online experience.
Do these still happen? They were common (-ish, at least in my circles) in the 90s during the crypto wars, often at the end of conferences and events, but I haven't come across them in recent years.
genewitch 4 hours ago [-]
Matrix protocol or at least the clients agree that several emoji is a key - which is fine - and you verify by looking at the keys (on each client) at the same time in person, ideally. I've only ever signed for people in person, and one remote attestation; but we had a separate verified private channel and attested the emoji that way.
brongondwana 41 minutes ago [-]
Also there's the problem that every human has to have perfect opsec or you get the problem we have now, where there are massive botnets out there of compromised home computers.
marcusb 1 hours ago [-]
Isn't this vaguely how the invite system at Lobsters functions? There's a public invite tree, and users risk their reputation (and posting access) when they invite new users.
withinboredom 8 minutes ago [-]
I know exactly zero people over there. I am also not about to go brown nose my way into it via IRC (or whatever chat they are using these days). I'd love to join, someday.
SuperShibe 2 hours ago [-]
I think this ideas problem might be the people part, specifically the majority type of people that will click absolutely anything for a free iPad
XorNot 5 hours ago [-]
I think technically this is the idea that GPG's web of trust was circling without quite staring at, which is the oddest thing about the protocol: it's used mostly today for machine authentication, which it's quite good at (i.e. deb repos)...but the tooling actually generally is oriented around verifying and trusting people.
wobfan 4 hours ago [-]
Yeah exactly, this was exactly the idea behind that. Unfortunately, while on paper it just sounds like a sound idea, at least IMO, though ineffective, it has proven time and time again that the WOT idea in PGP has no chance against the laziness of humans.
drcongo 5 hours ago [-]
I actually built this once, a long time ago for a very bizarre social network project. I visualised it as a mesh where individuals were the points where the threads met, and as someone's trust level rose, it would pull up the trust levels of those directly connected, and to a lesser degree those connected to them - picture a trawler fishing net and lifting one of the points where the threads meet. Similarly, a user whose trust lowered over time would pull their connections down with them. Sadly I never got to see it at the scale it needed to become useful as the project's funding went sideways.
littlestymaar 5 hours ago [-]
Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.
For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor I'd guess, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.
eadmund 4 hours ago [-]
> Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.
I don’t think that really follows. Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time. Various networks of merchants did the same in the Middle Ages.
littlestymaar 4 hours ago [-]
> Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time.
Under the supervision of the State (they are regulated and rely on the justice and police system to make things work).
> Various networks of merchants did the same in the Middle Ages.
They did, and because there was no State the amount of trust they could built was fairly limited compared to was has later been made possible by the development of modern states (the industrial revolution appearing in the UK has partly been attributed to the institutional framework that existed there early).
Private actors can, and do, and have always done, build their own makeshift trust network, but building a society-wide trust network is a key pillar of what makes modern states “States” (and it directly derives from the “monopoly of violence”).
im3w1l 3 hours ago [-]
GPG lost, TLS won. Both are actually webs of trust with the same underlying technology. But they have different cultures and so different shapes. GPG culture is to trust your friends and have them trust their friends. With TLS culture you trust one entity (e.g. browser) that trusts a couple dozen entities that (root certificate authorities), that either signs keys directly or can fan out to intermediate authorities that then sign keys. The hierarchical structure has proven much more successful than the decentralized one.
Frankly I don't trust my friends of friends of friends not to add thirst trap bots.
lxgr 3 hours ago [-]
The difference is in both culture and topology.
TLS (or more accurately, the set of browser-trusted X.509 root CAs) is extremely hierarchical and all-or-nothing.
The PGP web of trust is non-hierarchical and decentralized (from an organizational point of view). That unfortunately makes it both more complex and less predictable, which I suppose is why it “lost” (not that it’s actually gone, but I personally have about one or maybe two trusted, non-expired keys left in my keyring).
kevin_thibedeau 2 hours ago [-]
The issue is key management. TLS doesn't usually require client keys. GPG requires all receivers to have a key.
2 hours ago [-]
nashashmi 6 hours ago [-]
We LLMs only output the average response of humanity because we can only give results that are confirmed by multiple sources. On the contrary, many of HN’s comments are quite unique insights that run contrary to the average popular thought. If this is ever to be emulated by an LLM, we would give only gibberish answers. If we had a filter to that gibberish to only permit answers that are reasonable and sensible, our answers would be boring and still be gibberish. In order for our answers to be precise, accurate and unique, we must use something other than LLMs.
miki123211 6 hours ago [-]
How do you know it isn't already happening?
With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.
Joker_vD 4 hours ago [-]
But what if LLMs will start leaving constructive and helpful comments? I personally would feel like xkcd [0], but others may disagree.
I was browsing a Reddit thread recently and noticed that all of the human comments were off-topic one-liners and political quips, as is tradition.
Buried at the bottom of the thread was a helpful reply by an obvious LLM account that answered the original question far better than any of the other comments.
I'm still not sure if that's amazing or terrifying.
gosub100 3 hours ago [-]
That's the moment we will realize that it's not the spam that bothers us, but rather that there is no human interaction. How vapid would it be to have a bunch of fake comments saying eat more vegetables, good job for not running over that animal in the road, call mom tonight it's been a while, etc. They mean nothing if they were generated by a piece of silicon.
withinboredom 4 minutes ago [-]
I believe they mean whatever you mean it to mean. Humanity has existed on religion based on what some dead people wrote down, just fine. Er, well, maybe not "just fine" but hopefully you get the gist: you can attribute whatever meaning you want to the AI, holy text, or other people.
melagonster 2 hours ago [-]
This just another reddit or HN.
djoldman 3 hours ago [-]
A variant of this was done for 4chan by the fantastic Yannic Kilcher:
See the Metal Gear franchise [0], the Dead Internet Theory [1], and many others who have predicted this.
> Hideo Kojima's ambitious script in Metal Gear Solid 2 has been praised, some calling it the first example of a postmodern video game, while others have argued that it anticipated concepts such as post-truth politics, fake news, echo chambers and alternative facts.
Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.
bob1029 5 hours ago [-]
It's like a Pavlovian response in me to respond to anything SQL or C# adjacent.
I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...
OccamsMirror 3 hours ago [-]
I always love checking the comments on articles about Bevy to see how the metaverse client guy is going.
gosub100 3 hours ago [-]
The paths are going to be predictable by necessity. It's not possible for everyone to have a uniquely derived interpretation about most common issues, whether that's standard lightning rod politics but also extending somewhat into tech socio/political issues.
r3trohack3r 3 hours ago [-]
HN already has a pretty good immune system for this sort of thing. Low-effort or repetitive comments get down-voted, flagged, and rate-limited fast. The site’s karma and velocity heuristics are crude compared with fancy ML, but they work because the community is tiny relative to Reddit or Twitter and the mods are hands-on. A fleet of sock-puppet LLM accounts would need to consistently clear that bar—i.e. post things people actually find interesting—otherwise they’d be throttled or shadow-killed long before they “replace all human text.”
Even if someone managed to keep a few AI-driven accounts alive, the marginal cost is high. Running inference on dozens of fresh threads 24/7 isn’t free, and keeping the output from slipping into generic SEO sludge is surprisingly hard. (Ask anyone who’s tried to use ChatGPT to farm karma—it reeks after a couple of posts.) Meanwhile the payoff is basically zero: you can’t monetize HN traffic, and karma is a lousy currency for bot-herders.
Could we stop a determined bad actor with resources? Probably, but the countermeasures would look the same as they do now: aggressive rate-limits, harsher newbie caps, human mod review, maybe some stylometry. That’s annoying for legit newcomers but not fatal. At the end of the day HN survives because humans here actually want to read other humans. As soon as commenters start sounding like a stochastic parrot, readers will tune out or flag, and the bots will be talking to themselves.
Written by GPT-3o
no_time 6 hours ago [-]
I can’t think of an solution that preserves the open and anonymous nature that we enjoy now. I think most open internet forums will go one of the following routes:
- ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.
- Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.
- Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.
All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.
theasisa 5 hours ago [-]
Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.
You'd need to have a permanent captcha that tracks that the actions you perform are human-like, such as mouse movement or scrolling on phone etc. And even then it would only deter current AI bots but not for long as impersonation human behavior would be a 'fun' challenge to break.
Trusted relationships are only as trustworthy as the humans trusting each other, eventually someone would break that trust and afterwards it would be bots trusting bots.
Due to bots already filling up social media with their spew and that being used for training other bots the only way I see this resolving itself is by eventually everything becoming nonsensical and I predict we aren't that far from it happening. AI will eat itself.
no_time 4 hours ago [-]
>Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.
Correct. But for curbing AI slop comments this is enough imo. As of writing this, you can quite easily spot LLM generated comments and ban them. If you have a verification system in place then you banned the human too, meaning you put a stop to their spamming.
icoder 5 hours ago [-]
I'm sometimes thinking about account verification that requires work/effort over time, could be something fun even, so that it becomes a lot harder to verify a whole army of them. We don't need identification per se, just being human and (somewhat) unique.
See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.
dns_snek 5 hours ago [-]
There must be a technical solution to this based on some cryptographic black magic that both verifies you to be a unique person to a given website without divulging your identity, and without creating a globally unique identifier that would make it easy to track us across the web.
Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.
vvillena 4 hours ago [-]
These kinds of solutions are already deployed in some places. A trusted ID server creates a bunch of anonymous keys for a person, the person uses these keys to identify in pages that accept the ID server keys. The page has no way to identify a person from a key.
The weak link is in the ID servers themselves. What happens if the servers go down, or if they refuse to issue keys? Think a government ID server refusing to issue keys for a specific person. Pages that only accept keys from these government ID servers, or that are forced to only accept those keys, would be inaccessible to these people. The right to ID would have to be enshrined into law.
no_time 3 hours ago [-]
As I see it, a technical solution to AI spam inherently must include a way to uniquely identify particular machines at best, and particular humans responsible for said machines at worst.
This verification mechanism must include some sort of UUID to reign in a single bad actor who happens to validate his/her bot farm of 10000 accounts from the same certificate.
05 5 hours ago [-]
Oh you mean something like Apple's Private Access Tokens?
I don't think that's what I was going for? As far as I can see it relies on a locked down software stack to "prove" that the user is running blessed software on top of blessed hardware. That's one way of dealing with bots but I'm looking for a solution that doesn't lock us out of our own devices.
3 hours ago [-]
dangoodmanUT 1 hours ago [-]
I imagine LLMs already have this too
ahoka 6 hours ago [-]
Probably already happening.
genewitch 4 hours ago [-]
I have all of n-gate as json with the cross references cross referenced.
Just in case I need to check for plagiarism.
I don't have enough Vram nor enough time to do anything useful on my personal computer. And yes I wrote vram like that to pothole any EE.
_Algernon_ 6 hours ago [-]
This is probably already happening to some extent. I think the best we can hope for is xkcd 810: https://xkcd.com/810/
drcongo 5 hours ago [-]
The internet is going to become like William Basinski's Disintegration Loops, regurgitating itself with worse fidelity until it's all just unintelligible noise.
userbinator 7 hours ago [-]
I had a 20 GiB JSON file of everything that has ever happened on Hacker News
I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.
sph 7 hours ago [-]
2 MB per day doesn't sound like a lot. The amount of posts probably has increased exponentially over the years, especially after the Reddit fiasco when we had our latest, and biggest neverending September.
Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.
samplatt 6 hours ago [-]
Plus the JSON structure metadata, which for the average comment is going to add, what, 10%?
kevincox 5 hours ago [-]
I suspect it is closer to 100% increase for the average comment. If the average comment is a few senteces and the metadata has id, parent id, author, timestamp and a vote count that can add up pretty fast.
6 hours ago [-]
FabHK 6 hours ago [-]
Around one book every 12 hours.
xnx 2 hours ago [-]
20 GB JSON is surprising to me. I have an sqlite file of all HN data that is 20 GB, it would be much larger as JSON.
jakegmaths 10 hours ago [-]
Your query for Java will include all instances of JavaScript as well, so you're over representing Java.
smarnach 9 hours ago [-]
Similarly, the Rust query will include "trust", "antitrust", "frustration" and a bunch of other words
sph 7 hours ago [-]
A guerilla marketing plan for a new language is to call it a common one word syllable, so that it appears much more prominent than it really is on badly-done popularity contests.
Call it "Go", for example.
(Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)
setopt 6 hours ago [-]
Let’s make a language called “A” in that case. (I mean C was fine, so why not one letter?)
InDubioProRubio 7 hours ago [-]
You also wouldn't acronym hijack overload to boost mental presence in gamers LOL
Ah right… maybe even more unexpected then to see a decline
cs02rm0 7 hours ago [-]
I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.
I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.
karel-3d 1 hours ago [-]
New Java looks actually good, but most of the Java actual ecosystem is stuck in the past.... and you will mostly work within the existing ecosystem
cess11 4 hours ago [-]
Recruiting Java developers is easy mode, there are rather large consultancies and similar suppliers that will sell or rent them to you in bulk so you don't need to nag with adverts to the same extent as with pythonistas and rubyists and TypeScript.
But there is likely some decline for Java. I'd bet Elixir and Erlang have been nibbling away on the JVM space for quite some time, they make it pretty comfortable to build the kind of systems you'd otherwise use a JVM-JMS-Wildfly/JBoss rig for. Oracle doesn't help, they take zero issue with being widely perceived as nasty and it takes a bit of courage and knowledge to manage to avoid getting a call from them at your inconvenience.
patates 3 hours ago [-]
Speaking as someone who ended up in the corporate Java world somewhat accidentally (wasn't deep in the ecosystem before): even the most invested Java shops seem wary of Oracle's influence now. Questioning Oracle tech, if not outright planning an exit strategy, feels like the default stance.
10 hours ago [-]
SilverBirch 6 hours ago [-]
What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?
euroderf 6 hours ago [-]
Not to mention three-letter agencies, incidentally attaching real names to HN monikers ?
alt227 2 hours ago [-]
If something is on the public web, it is already being scraped by thousands of bots.
krapp 5 hours ago [-]
HN has an API, as mentioned in the article, which isn't even rate limited. And all of the data is hosted on Firebase, which is a YC company. It's fine.
mikeevans 2 hours ago [-]
Firebase is owned and operated by Google (has been for a while).
dangoodmanUT 1 hours ago [-]
there's literally an API they promote. Did you read that part before trying to cancel them?
flakiness 9 hours ago [-]
I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.
minimaxir 9 hours ago [-]
That's not cheating, that's just pragmatic.
AbstractH24 2 hours ago [-]
What a pragmatic way to rationalize most cheating
xnx 1 hours ago [-]
I have this data and a bunch of interesting analysis to share. Any suggestions on the best method to share results?
I like Tableau Public, because it allows for interactivity and exploration, but it can't handle this many rows of data.
Is there a good tool for making charts directly from Clickhouse data?
texodus 54 minutes ago [-]
No Clickhouse connector for free accounts yet, but if you can drop a Parquet file on S3 you can try https://prospective.co
I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.
jasonthorsness 10 hours ago [-]
Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…
I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.
// DefaultStaleIf marks stale at 60 seconds after creation, then frequently for the first few days after an item is
// created, then quickly tapers after the first week to never again mark stale items more than a few weeks old.
const DefaultStaleIf = "(:now-refreshed)>" +
"(60.0*(log2(max(0.0,((:now-Time)/60.0))+1.0)+pow(((:now-Time)/(24.0*60.0*60.0)),3)))"
Hah, I've been scraping HN over the past couple weeks to do something similar! Only submissions though, not comments. It was after I went to /newest and was faced with roughly 9/10 posts being AI-related. I was curious what the actual percentage of posts on HN were about AI, and also how it compared to other things heavily hyped in the past like Web3 and crypto.
sebastianmestre 2 hours ago [-]
Can you remake the stacked graphs with the variable of interest at the bottom? Its hard to see the percentage of Rust when it's all the way at the top with a lot of noise on the lower layers
Edit: or make a non-stacked version?
jasonthorsness 1 hours ago [-]
Lots of valid criticism here of these graphs and the queries; I'll write a follow-up article.
stefs 7 hours ago [-]
please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.
jasonthorsness 1 hours ago [-]
It's true :( but line charts of the data had too much overlap and were hard to see anything. I was thinking next time maybe multiple line charts aligned and stacked, with one series per region?
What is this even supposed to represent? The entire justification I could give for stacked bars is that you could permute the sub-bars and obtain comparable results. Do the bars still represent additive terms? Multiplicative constants? As a non-physicist I would have no idea on how to interpret this.
dguest 5 hours ago [-]
It's a histogram. Each color is a different simulated physical process: they can all happen in particle collisions, so the sum of all of them should add up to the data the experiment takes. The data isn't shown here because it hasn't been taken yet: this is an extrapolation to a future dataset. And the dotted lines are some hypothetical signal.
The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.
For the record I think it's a terrible convention, it just somehow became standard in some fields.
9rx 8 hours ago [-]
> The Rise Of Rust
Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!
emilbratt 8 hours ago [-]
The chart is a stacked one, so we are looking at the height each category takes up and not the height each category reach.
matsemann 9 hours ago [-]
One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?
pjc50 5 hours ago [-]
I don't think you can get the individual vote interactions, and that's probably a good thing. It is irritating that the "API" won't let me get vote counts; I should go back to my Python scraper of the comments page, since that's the only way to get data on post scores.
I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.
(HN does not like jokes, but you can get away with it if you also include an explanation)
minimaxir 9 hours ago [-]
The only vote data that is visible via any HN API is the scores on submissions.
Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.
ryandrake 9 hours ago [-]
Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.
saagarjha 8 hours ago [-]
I did this once by scraping the site (very slowly, to be nice). It’s not that hard since the HTML is pretty consistent.
xnx 2 hours ago [-]
Some of this data is available through the API (and Clickhouse and BigQuery).
I wrote a Puppeteer script to export my own data that isn't public (upvotes, downvotes, etc.)
nottorp 8 hours ago [-]
> Are there users I constantly upvote/downvote?
Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
vidarh 6 hours ago [-]
The exception, to me, is if I'm questioning whether the comment was in good faith or not, where the trackrecord of the user on a given topic could go some way to untangle that. It happens rarely here, compared to e.g. Reddit, but sometimes it's mildly useful.
pjc50 5 hours ago [-]
I recognize twenty or so of the most frequent and/or annoying posters.
The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.
matsemann 8 hours ago [-]
Same, which is why it would be cool to see. Perhaps there are people I both upvote and downvote?
thaumasiotes 7 hours ago [-]
> It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?
nottorp 4 hours ago [-]
Either you got the direction wrong or you'd support someone who is wrong just because you like them.
You're wrong in both cases :)
thaumasiotes 3 hours ago [-]
Maybe try rereading my comment?
nottorp 42 minutes ago [-]
You're right. But I still disagree with you. Both ways are wrong if you want to maintain a constructive discussion.
Maybe you don't like my opinions on cogwheel shaving but you will agree with me on quantum frobnicators. But if you first come across about my comments on cogwheel shaving and note the user name, you may not even read the comments on quantum frobnicators later.
9rx 8 hours ago [-]
> What's my upvote/downvote ratio?
Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?
It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.
For those who seek fidget toys, there are better devices for that.
immibis 8 hours ago [-]
Actually, its most useful purpose is to hide opinions you disagree with - if 3 other people agree with you.
Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.
9rx 8 hours ago [-]
So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?
jpc0 5 hours ago [-]
It’s not so much rereading the comments but more a matter of it being indication to other users.
The C++ example for instance above, you are likely to be downvoted for supporting C++ over rust and therefore most people reading through the comments (and LLMs correlating comment “karma” to how liked a comment is) will generally associate Rust > C++, which isn’t a nuanced opinion at all and IMHO is just plain wrong a decent amount if times. They are tools and have their uses.
So generally it shows the sentiment of the group and humans and conditioned to follow the group.
9rx 2 hours ago [-]
An indication of what? It is impossible to know why a user pressed an arrow button. Any meaning the user may have wanted to convey remains their own private information.
All it can fundamentally serve is to act as an impoverished man's read receipt. And why would you want to give trolls that information? Fishing to find out if anyone is reading what they're posting is their whole game. Do not feed the trolls, as they say.
matsemann 8 hours ago [-]
Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?
saagarjha 8 hours ago [-]
If Hacker News had reactions I’d put an eye roll here.
9rx 8 hours ago [-]
You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.
tacker2000 7 hours ago [-]
Yea, i also get the feeling that these rust evangelists get more annoying every day ;p
4 hours ago [-]
deadbabe 4 hours ago [-]
Is the 20GB JSON file available?
Am4TIfIsER0ppos 1 hours ago [-]
I hope they snatched my flagged comments. I would be pleased to have helped make the AI into an asshole. Here's hoping for another Tay AI.
hsbauauvhabzb 8 hours ago [-]
Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.
Havoc 7 hours ago [-]
It’s on firebase/bigquery to avoid people doing what OP did
If you click the api link bottom of page it’ll explain.
I had to CTRL-C and resume a few times when it stalled; it might be a bug in my tool
xnx 2 hours ago [-]
Is there any advantage to making all these requests instead of using Clickhouse o BigQuery?
jasonthorsness 1 hours ago [-]
Probably not :P. I made the client for another project, https://hn.unlurker.com, and then just jumped straight to using it to download the whole thing instead of searching for an already available full data set.
andrewshadura 8 hours ago [-]
Funny nobody's mentioned "correct horse battery staple" in the comments yet…
pier25 9 hours ago [-]
would love to see the graph of React, Vue, Angular, and Svelte
hellostgeroge 5 hours ago [-]
[dead]
a3w 4 hours ago [-]
Cool project. Cool graphs.
But any GDPR requests for info and deletion in your inbox, yet?
Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.
Let's say you discovered a pendrive of a long lost civilization and train a model on that text data. How would you or the model know that the pendrive contained data on wanking and divorce without anykind of external grounding to that data?
Is it not still good to be exposed to the experiences of others, even if one cannot experience these things themself?
- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`
- ClickHouse, no signup needed, can run queries in browser directly, [1]
[1] https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...
The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?
Maybe that'll be a use case for blockchain tech. See the whole posting history of the account on-chain.
Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).
Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.
Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.
I also want something like this for a lightweight social media experience. I’ve been off of the big platforms for years now, but really want a way to share life updates and photos with a group of trusted friends and family.
The more hostile the platforms become, the more viable I think something like this will become, because more and more people are frustrated and willing to put in some work to regain some control of their online experience.
For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor I'd guess, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.
I don’t think that really follows. Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time. Various networks of merchants did the same in the Middle Ages.
Under the supervision of the State (they are regulated and rely on the justice and police system to make things work).
> Various networks of merchants did the same in the Middle Ages.
They did, and because there was no State the amount of trust they could built was fairly limited compared to was has later been made possible by the development of modern states (the industrial revolution appearing in the UK has partly been attributed to the institutional framework that existed there early).
Private actors can, and do, and have always done, build their own makeshift trust network, but building a society-wide trust network is a key pillar of what makes modern states “States” (and it directly derives from the “monopoly of violence”).
Frankly I don't trust my friends of friends of friends not to add thirst trap bots.
TLS (or more accurately, the set of browser-trusted X.509 root CAs) is extremely hierarchical and all-or-nothing.
The PGP web of trust is non-hierarchical and decentralized (from an organizational point of view). That unfortunately makes it both more complex and less predictable, which I suppose is why it “lost” (not that it’s actually gone, but I personally have about one or maybe two trusted, non-expired keys left in my keyring).
With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.
[0] https://xkcd.com/810/
Buried at the bottom of the thread was a helpful reply by an obvious LLM account that answered the original question far better than any of the other comments.
I'm still not sure if that's amazing or terrifying.
https://en.wikipedia.org/wiki/GPT4-Chan
> Hideo Kojima's ambitious script in Metal Gear Solid 2 has been praised, some calling it the first example of a postmodern video game, while others have argued that it anticipated concepts such as post-truth politics, fake news, echo chambers and alternative facts.
[0] https://en.wikipedia.org/wiki/Metal_Gear
[1] https://en.wikipedia.org/wiki/Dead_Internet_theory
Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.
I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...
Even if someone managed to keep a few AI-driven accounts alive, the marginal cost is high. Running inference on dozens of fresh threads 24/7 isn’t free, and keeping the output from slipping into generic SEO sludge is surprisingly hard. (Ask anyone who’s tried to use ChatGPT to farm karma—it reeks after a couple of posts.) Meanwhile the payoff is basically zero: you can’t monetize HN traffic, and karma is a lousy currency for bot-herders.
Could we stop a determined bad actor with resources? Probably, but the countermeasures would look the same as they do now: aggressive rate-limits, harsher newbie caps, human mod review, maybe some stylometry. That’s annoying for legit newcomers but not fatal. At the end of the day HN survives because humans here actually want to read other humans. As soon as commenters start sounding like a stochastic parrot, readers will tune out or flag, and the bots will be talking to themselves.
Written by GPT-3o
- ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.
- Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.
- Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.
All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.
You'd need to have a permanent captcha that tracks that the actions you perform are human-like, such as mouse movement or scrolling on phone etc. And even then it would only deter current AI bots but not for long as impersonation human behavior would be a 'fun' challenge to break.
Trusted relationships are only as trustworthy as the humans trusting each other, eventually someone would break that trust and afterwards it would be bots trusting bots.
Due to bots already filling up social media with their spew and that being used for training other bots the only way I see this resolving itself is by eventually everything becoming nonsensical and I predict we aren't that far from it happening. AI will eat itself.
Correct. But for curbing AI slop comments this is enough imo. As of writing this, you can quite easily spot LLM generated comments and ban them. If you have a verification system in place then you banned the human too, meaning you put a stop to their spamming.
See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.
Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.
The weak link is in the ID servers themselves. What happens if the servers go down, or if they refuse to issue keys? Think a government ID server refusing to issue keys for a specific person. Pages that only accept keys from these government ID servers, or that are forced to only accept those keys, would be inaccessible to these people. The right to ID would have to be enshrined into law.
This verification mechanism must include some sort of UUID to reign in a single bad actor who happens to validate his/her bot farm of 10000 accounts from the same certificate.
https://support.apple.com/en-us/102591
https://blog.cloudflare.com/eliminating-captchas-on-iphones-...
Just in case I need to check for plagiarism.
I don't have enough Vram nor enough time to do anything useful on my personal computer. And yes I wrote vram like that to pothole any EE.
I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.
Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.
Call it "Go", for example.
(Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)
I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.
But there is likely some decline for Java. I'd bet Elixir and Erlang have been nibbling away on the JVM space for quite some time, they make it pretty comfortable to build the kind of systems you'd otherwise use a JVM-JMS-Wildfly/JBoss rig for. Oracle doesn't help, they take zero issue with being widely perceived as nasty and it takes a bit of courage and knowledge to manage to avoid getting a call from them at your inconvenience.
I like Tableau Public, because it allows for interactivity and exploration, but it can't handle this many rows of data.
Is there a good tool for making charts directly from Clickhouse data?
I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.
https://github.com/jasonthorsness/unlurker/blob/main/hn/core...Edit: or make a non-stacked version?
[1]: https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-...
The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.
For the record I think it's a terrible convention, it just somehow became standard in some fields.
Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!
I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.
(HN does not like jokes, but you can get away with it if you also include an explanation)
Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.
I wrote a Puppeteer script to export my own data that isn't public (upvotes, downvotes, etc.)
Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.
...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?
You're wrong in both cases :)
Maybe you don't like my opinions on cogwheel shaving but you will agree with me on quantum frobnicators. But if you first come across about my comments on cogwheel shaving and note the user name, you may not even read the comments on quantum frobnicators later.
Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?
It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.
For those who seek fidget toys, there are better devices for that.
Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.
The C++ example for instance above, you are likely to be downvoted for supporting C++ over rust and therefore most people reading through the comments (and LLMs correlating comment “karma” to how liked a comment is) will generally associate Rust > C++, which isn’t a nuanced opinion at all and IMHO is just plain wrong a decent amount if times. They are tools and have their uses.
So generally it shows the sentiment of the group and humans and conditioned to follow the group.
All it can fundamentally serve is to act as an impoverished man's read receipt. And why would you want to give trolls that information? Fishing to find out if anyone is reading what they're posting is their whole game. Do not feed the trolls, as they say.
If you click the api link bottom of page it’ll explain.
I had to CTRL-C and resume a few times when it stalled; it might be a bug in my tool
But any GDPR requests for info and deletion in your inbox, yet?